You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Markus Klems <ma...@gmail.com> on 2012/10/18 19:33:56 UTC

What does ReadRepair exactly do?

Hi guys,

I am looking through the Cassandra source code in the github trunk to
better understand how Cassandra's fault-tolerance mechanisms work. Most
things make sense. I am also aware of the wiki and DataStax documentation.
However, I do not understand what read repair does in detail. The method
RowRepairResolver.resolveSuperset(Iterable<ColumnFamily> versions) seems to
do the trick of merging conflicting versions of column family replicas and
builds the set of columns that need to be "repaired". From looking at the
source code, I do not understand how this set is built and I do not
understand how the reconciliation is executed. ReadRepair does not seem to
trigger a Column.reconcile() to reconcile conflicting column versions on
different servers. Does it?

If this is not what read repair does, then: What kind of inconsistencies
are resolved by read repair? And: How are the inconsistencies resolved?

Could someone give me a hint?

Thanks so much,

-Markus

Re: What does ReadRepair exactly do?

Posted by Manu Zhang <ow...@gmail.com>.

I think so. Otherwise, we may never complete a read if writes come in
continuously.

On Wed, Oct 24, 2012 at 9:04 AM, shankarpnsn <sh...@gmail.com> wrote:

> manuzhang wrote
> > why repair again? We block until the consistency constraint is met. Then
> > the latest version is returned and repair is done asynchronously if any
> > mismatch. We may retry read if fewer columns than required are returned.
>
> Just to make sure I understand you correct, considering the case when a
> read
> repair is in flight and a subsequent write affects one or more of the
> replicas that was scheduled to received the repair mutations. In this case,
> are you saying that we return the older version to the user rather than the
> latest version that was effected by the write ?
>
>
>
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583355.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at
> Nabble.com.
>

Re: What does ReadRepair exactly do?

Posted by Manu Zhang <ow...@gmail.com>.

oh, it would clarity a lot if you go to read the source code; the method is
o.a.c.service.StorageProxy.fetchRows if I remember it correctly

On Wed, Oct 24, 2012 at 10:26 PM, Manu Zhang <ow...@gmail.com>wrote:

> And we don't send read request to all of the three replicas (R1, R2, R3)
> if CL=QUOROM; just 2 of them depending on proximity
>
>
> On Wed, Oct 24, 2012 at 10:20 PM, Hiller, Dean <De...@nrel.gov>wrote:
>
>> The user will meet the required consistency unless you encounter some kind
>> of bug in cassandra.  You will either get the older value or the newer
>> value. If you read quorum, and maybe a write CL=1 just happened, you may
>> get the older or new value depending on if the node that received the
>> write was involved.  If you read quorum and your wrote CL=QUOROM, then you
>> may get the newer value or the older value depending on who gets their
>> first so to speak.
>>
>> In your scenario, if the read repair read from R2 just before the write is
>> applied, you get the old value.  If it read from R2 just after the write
>> was applied, it gets the new value.  BOTH of these met the consistency
>> constraint.  A better example to clear this up may be the following...  If
>> you read a value at CL=QUOROM, and you have a write 20ms later, you get
>> the old value, right?  And it met the consistency level, right?  NOW, what
>> about if the write is 1ms later?  What if it the right is .00001ms later?
>> It still met the consistency level, right?  If it is .00001ms before, you
>> get the new value as it repairs first with the new node.
>>
>> It is just when programming, your read may get the newer value or older
>> value and generally if you write the code in a way that works, this
>> concept works out great in most cases(in some cases, you need to think a
>> bit differently and solve it other ways).
>>
>> I hope that clears it up
>>
>> Later,
>> Dean
>>
>> On 10/24/12 8:02 AM, "shankarpnsn" <sh...@gmail.com> wrote:
>>
>> >Hiller, Dean wrote
>> >> in general it is okay to get the older or newer value.  If you are
>> >>reading
>> >> 2 rows however instead of one, that may change.
>> >
>> >This is certainly interesting, as it could mean that the user could see a
>> >value that never met the required consistency. For instance with 3
>> >replicas
>> ><R1,R2,R3> and a quorum consistency, assume that R1 is initiating a read
>> >(becomes the coordinator) - notices a conflict with R2 (assume R1 has a
>> >more
>> >recent value) and initiates a read repair with its value. Meanwhile R2
>> and
>> >R3 have seen two different writes with newer values than what was
>> computed
>> >by the read repair. If R1 were to respond back to the user with the value
>> >that was computed at the time of read repair, wouldn't it be a value that
>> >never met the consistency constraint? I was thinking if this should
>> >trigger
>> >another round of repair that tries to reach the consistency constraint
>> >with
>> >a newer value or time-out, which is the expected case when you don't meet
>> >the required consistency. Please let me know if I'm missing something
>> >here.
>> >
>> >
>> >
>> >--
>> >View this message in context:
>> >
>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does
>> >-ReadRepair-exactly-do-tp7583261p7583366.html
>> >Sent from the cassandra-user@incubator.apache.org mailing list archive
>> at
>> >Nabble.com.
>>
>>
>

Re: What does ReadRepair exactly do?

Posted by aaron morton <aa...@thelastpickle.com>.

>> replicas but to ensure we read at least one newest value as long as write
>> quorum succeeded beforehand and W+R > N.
> 
This is correct.
It's not that a quorum of nodes agree it's that a quorum of nodes participate. If a quorum participate in both the write and read you are guaranteed that one node was involved in both. The wikipedia definition helps here "A quorum is the minimum number of members of a deliberative assembly necessary to conduct the business of that group" http://en.wikipedia.org/wiki/Quorum  

It's a two step process: First do we have enough people to make a decision? Second following the rules what was the decision?

In C* the rule is to use the value with the highest time stamp. Not the value with the highest number of  "votes". The red boxes on this slide are the winning values http://www.slideshare.net/aaronmorton/cassandra-does-what-code-mania-2012/67  (thinking one of my slides in that deck may have been misleading in the past). In Riak the rule is to use Vector Clocks. 

So 
> I agree that returning val4 is the right thing to do if quorum (two) nodes
> among (node1,node2,node3) have the val4
Is incorrect.
We return the value with the highest time stamp returned from the nodes involved in the read. Only one needs to have val4. 

> The heart of the problem
> here is that the coordinator responds to a client request "assuming" that
> the consistency has been achieved the moment is issues a row repair with the
> super-set of the resolved value; without receiving acknowledgement on the
> success of a repair from the replicas for a given consistency constraint. 
and
> My intuition behind saying this is because we
> would respond to the client without the replicas having confirmed their
> meeting the consistency requirement.

It is not necessary for the coordinator to wait. 

Consider an example: The app has stopped writing to the cluster, for a certain column nodes 1,2 and 3 have value:timestamp bar:2, bar:2 and foo:1 respectively. The last write was a successful CL QUORUM write of bar with timestamp 2. However node 3 did acknowledge this write for some reason. 

To make it interesting the commit log volume on node 3 is full. Mutations are blocking in the commit log queue so any write on node 3 will timeout and fail, but reads are still working. We could imagine this is why node 3 did not commit bar:2 

Some read examples, RR is not active:

1) Client reads from node 4 (a non replica) with CL QUOURM, request goes to nodes 1 and 2. Both agree on bar as value. 
2) Client reads from node 3 with CL QUORUM, request is processed locally and on node 2.
	* There is a digest mismatch
	* Row Repair read runs to read from for nodes 2 and 3.
	* The super set resolves to bar:2
	* Node 3 (the coordinator) queues a delta write locally to write bar:2. No other delta writes are sent.
	* Node 3 returns bar:2 to the client
3) Client reads from node 3 at CL QUOURM. The same thing as (2) happens and bar:2 is returned. 
4) Client reads from node 2 at CL QUOURM, read goes to 2 and 3. Roughly the same thing as (2) happens and bar:2 is returned. 
5) Client reads from node 1 as CL ONE. Read happens locally only and returns bar:2
6) Client reads from node 3 as CL ONE. Read happens locally only and returns foo:1

So:
* A read CL QUOURM will always return bar:2 even if node 3 only has foo:1 on disk. 
* A read at CL ONE will return no value or any previous write.

The delta write from the Row Repair goes to a single node so R + W > N cannot be applied. It can almost be thought of as  internal implementation. The delta write from a Digest Mismatch, HH writes, full RR writes and nodetool repair are used to:

* Reduce the chance of a Digest Mismatch when CL > ONE
* Eventually reach a state where reads at any CL return the last write. 

They are not used to ensure strong consistency when R + W > N. You could turn those things off and R + W > N would still work. 
 
Hope that helps. 


-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 26/10/2012, at 7:15 AM, shankarpnsn <sh...@gmail.com> wrote:

> manuzhang wrote
>> read quorum doesn't mean we read newest values from a quorum number of
>> replicas but to ensure we read at least one newest value as long as write
>> quorum succeeded beforehand and W+R > N.
> 
> I beg to differ here. Any read/write, by definition of quorum, should have
> at least n/2 + 1 replicas that agree on that read/write value. Responding to
> the user with a newer value, even if the write creating the new value hasn't
> completed cannot guarantee any read consistency > 1. 
> 
> 
> Hiller, Dean wrote
>>> Kind of an interesting question
>>> 
>>> I think you are saying if a client read resolved only the two nodes as
>>> said in Aaron's email back to the client and read -repair was kicked off
>>> because of the inconsistent values and the write did not complete yet and
>>> I guess you would have two nodes go down to lose the value right after
>>> the
>>> read, and before write was finished such that the client read a value
>>> that
>>> was never stored in the database.  The odds of two nodes going out are
>>> pretty slim though.
>>> Thanks,
>>> Dean
> 
> Bingo! I do understand that the odds of a quorum nodes going down are low
> and that any subsequent read would achieve a quorum. However, I'm wondering
> what would be the right thing to do here, given that the client has
> particularly asked for a certain consistency on the read and cassandra
> returns a value that doesn't have the consistency. The heart of the problem
> here is that the coordinator responds to a client request "assuming" that
> the consistency has been achieved the moment is issues a row repair with the
> super-set of the resolved value; without receiving acknowledgement on the
> success of a repair from the replicas for a given consistency constraint. 
> 
> In order to adhere to the given consistency specification, the row repair
> (due to consistent reads) should repeat the read after issuing a
> "consistency repair" to ensure if the consistency is met. Like Manu
> mentioned, this could of course lead to a number of repeat reads if the
> writes arrive quickly - until the read gets timed out. However, note that we
> would still be honoring the consistency constraint for that read. 
> 
> 
> 
> --
> View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583400.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: What does ReadRepair exactly do?

Posted by shankarpnsn <sh...@gmail.com>.

manuzhang wrote
> read quorum doesn't mean we read newest values from a quorum number of
> replicas but to ensure we read at least one newest value as long as write
> quorum succeeded beforehand and W+R > N.

I beg to differ here. Any read/write, by definition of quorum, should have
at least n/2 + 1 replicas that agree on that read/write value. Responding to
the user with a newer value, even if the write creating the new value hasn't
completed cannot guarantee any read consistency > 1. 


Hiller, Dean wrote
>> Kind of an interesting question
>>
>> I think you are saying if a client read resolved only the two nodes as
>> said in Aaron's email back to the client and read -repair was kicked off
>> because of the inconsistent values and the write did not complete yet and
>> I guess you would have two nodes go down to lose the value right after
>> the
>> read, and before write was finished such that the client read a value
>> that
>> was never stored in the database.  The odds of two nodes going out are
>> pretty slim though.
>> Thanks,
>> Dean

Bingo! I do understand that the odds of a quorum nodes going down are low
and that any subsequent read would achieve a quorum. However, I'm wondering
what would be the right thing to do here, given that the client has
particularly asked for a certain consistency on the read and cassandra
returns a value that doesn't have the consistency. The heart of the problem
here is that the coordinator responds to a client request "assuming" that
the consistency has been achieved the moment is issues a row repair with the
super-set of the resolved value; without receiving acknowledgement on the
success of a repair from the replicas for a given consistency constraint. 

In order to adhere to the given consistency specification, the row repair
(due to consistent reads) should repeat the read after issuing a
"consistency repair" to ensure if the consistency is met. Like Manu
mentioned, this could of course lead to a number of repeat reads if the
writes arrive quickly - until the read gets timed out. However, note that we
would still be honoring the consistency constraint for that read. 



--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583400.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: What does ReadRepair exactly do?

Posted by Manu Zhang <ow...@gmail.com>.

read quorum doesn't mean we read newest values from a quorum number of
replicas but to ensure we read at least one newest value as long as write
quorum succeeded beforehand and W+R > N.

On Fri, Oct 26, 2012 at 12:00 AM, Hiller, Dean <De...@nrel.gov> wrote:

> Kind of an interesting question
>
> I think you are saying if a client read resolved only the two nodes as
> said in Aaron's email back to the client and read -repair was kicked off
> because of the inconsistent values and the write did not complete yet and
> I guess you would have two nodes go down to lose the value right after the
> read, and before write was finished such that the client read a value that
> was never stored in the database.  The odds of two nodes going out are
> pretty slim though.
>
> Or, what if the node with part of the write went down, as long as the
> client stays up, he would complete his write on the other two nodes.
> Seems to me as long as two nodes don't fail, you are reading at quorum and
> fit with the consistency model since you get a value that will be on two
> nodes in the immediate future.
>
> Thanks,
> Dean
>
> On 10/25/12 9:45 AM, "shankarpnsn" <sh...@gmail.com> wrote:
>
> >aaron morton wrote
> >>> 2. You do a write operation (W1) with quorom of val=2
> >>> node1 = val1 node2 = val2 node3 = val1  (write val2 is not complete
> >>>yet)
> >> If the write has not completed then it is not a successful write at the
> >> specified CL as it could fail now.
> >>
> >> Therefor the R +W > N Strong Consistency guarantee does not apply at
> >>this
> >> exact point in time. A read to the cluster at this exact point in time
> >> using QUOURM may return val2 or val1. Again the operation W1 has not
> >> completed, if read R' starts and completes while W1 is processing it may
> >> or may not return the result of W1.
> >
> >I agree completely that it is fair to have this indeterminism in case of
> >partial/failed/in-flight writes, based on what nodes respond to a
> >subsequent
> >read.
> >
> >
> >aaron morton wrote
> >> It's import to point out the difference between Read Repair, in the
> >> context of the read_repair_chance setting, and Consistent Reads in the
> >> context of the CL setting. All of this is outside of the processing of
> >> your read request. It is separate from the stuff below.
> >>
> >> Inside the user read request when ReadCallback.get() is called and CL
> >> nodes have responded the responses are compared. If a DigestMismatch
> >> happens then a Row Repair read is started, the result of this read is
> >> returned to the user. This Row Repair read MAY detect differences, if it
> >> does it resolves the super set, sends the delta to the replicas and
> >> returns the super set value to be returned to the client.
> >>
> >>> In this case, for read R1, the value val2 does not have a quorum. Would
> >>> read
> >>> R1 return val2 or val4 ?
> >>
> >> If val4 is in the memtable on node before the second read the result
> >>will
> >> be val4.
> >> Writes that happen between the initial read and the second read after a
> >> Digest Mismatch are included in the read result.
> >
> >Thanks for clarifying this, Aaron. This is very much in line with what I
> >figured out from the code and brings me back to my initial question on the
> >point of when and what the user/client gets to see as the read result. Let
> >us, for now, consider only the repairs initiated as a part of /consistent
> >reads/. If the Row Repair (after resolving and sending the deltas to
> >replicas, but not waiting for a quorum success after the repair) returns
> >the
> >super set value immediately to the user, wouldn't it be a breach of the
> >consistent reads paradigm? My intuition behind saying this is because we
> >would respond to the client without the replicas having confirmed their
> >meeting the consistency requirement.
> >
> >I agree that returning val4 is the right thing to do if quorum (two) nodes
> >among (node1,node2,node3) have the val4 at the second read after digest
> >mismatch. But wouldn't it be incorrect to respond to user with any value
> >when the second read (after mismatch) doesn't find a quorum. So after
> >sending the deltas to the replicas as a part of the repair (still a part
> >of
> >/consistent reads/), shouldn't the value be read again to check for the
> >presence of a quorum after the repair?
> >
> >In the example we had, assume the mismatch is detected during a read R1
> >from
> >coordinator node C, that reaches node1, node2
> >State seen by C after first read R1:  <node1 = val1, node2 = val 2, node3
> >=
> >val1>
> >
> >A second read is initiated as a part of repair for consistent read of R1.
> >This second read observes the values (val1, val2) from (node1, node2) and
> >sends the corresponding row repair delta to node1. I'm guessing C cannot
> >respond back to user with val2 until C knows that node1 has actually
> >written
> >the value val2 thereby meeting the quorum. Is this interpretation correct
> >?
> >
> >
> >
> >
> >
> >
> >--
> >View this message in context:
> >
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does
> >-ReadRepair-exactly-do-tp7583261p7583395.html
> >Sent from the cassandra-user@incubator.apache.org mailing list archive at
> >Nabble.com.
>
>

Re: What does ReadRepair exactly do?

Posted by "Hiller, Dean" <De...@nrel.gov>.

Kind of an interesting question

I think you are saying if a client read resolved only the two nodes as
said in Aaron's email back to the client and read -repair was kicked off
because of the inconsistent values and the write did not complete yet and
I guess you would have two nodes go down to lose the value right after the
read, and before write was finished such that the client read a value that
was never stored in the database.  The odds of two nodes going out are
pretty slim though.

Or, what if the node with part of the write went down, as long as the
client stays up, he would complete his write on the other two nodes.
Seems to me as long as two nodes don't fail, you are reading at quorum and
fit with the consistency model since you get a value that will be on two
nodes in the immediate future.

Thanks,
Dean

On 10/25/12 9:45 AM, "shankarpnsn" <sh...@gmail.com> wrote:

>aaron morton wrote
>>> 2. You do a write operation (W1) with quorom of val=2
>>> node1 = val1 node2 = val2 node3 = val1  (write val2 is not complete
>>>yet)
>> If the write has not completed then it is not a successful write at the
>> specified CL as it could fail now.
>> 
>> Therefor the R +W > N Strong Consistency guarantee does not apply at
>>this
>> exact point in time. A read to the cluster at this exact point in time
>> using QUOURM may return val2 or val1. Again the operation W1 has not
>> completed, if read R' starts and completes while W1 is processing it may
>> or may not return the result of W1.
>
>I agree completely that it is fair to have this indeterminism in case of
>partial/failed/in-flight writes, based on what nodes respond to a
>subsequent
>read. 
>
>
>aaron morton wrote
>> It's import to point out the difference between Read Repair, in the
>> context of the read_repair_chance setting, and Consistent Reads in the
>> context of the CL setting. All of this is outside of the processing of
>> your read request. It is separate from the stuff below.
>> 
>> Inside the user read request when ReadCallback.get() is called and CL
>> nodes have responded the responses are compared. If a DigestMismatch
>> happens then a Row Repair read is started, the result of this read is
>> returned to the user. This Row Repair read MAY detect differences, if it
>> does it resolves the super set, sends the delta to the replicas and
>> returns the super set value to be returned to the client.
>> 
>>> In this case, for read R1, the value val2 does not have a quorum. Would
>>> read
>>> R1 return val2 or val4 ?
>> 
>> If val4 is in the memtable on node before the second read the result
>>will
>> be val4.  
>> Writes that happen between the initial read and the second read after a
>> Digest Mismatch are included in the read result.
>
>Thanks for clarifying this, Aaron. This is very much in line with what I
>figured out from the code and brings me back to my initial question on the
>point of when and what the user/client gets to see as the read result. Let
>us, for now, consider only the repairs initiated as a part of /consistent
>reads/. If the Row Repair (after resolving and sending the deltas to
>replicas, but not waiting for a quorum success after the repair) returns
>the
>super set value immediately to the user, wouldn't it be a breach of the
>consistent reads paradigm? My intuition behind saying this is because we
>would respond to the client without the replicas having confirmed their
>meeting the consistency requirement.
>
>I agree that returning val4 is the right thing to do if quorum (two) nodes
>among (node1,node2,node3) have the val4 at the second read after digest
>mismatch. But wouldn't it be incorrect to respond to user with any value
>when the second read (after mismatch) doesn't find a quorum. So after
>sending the deltas to the replicas as a part of the repair (still a part
>of
>/consistent reads/), shouldn't the value be read again to check for the
>presence of a quorum after the repair?
>
>In the example we had, assume the mismatch is detected during a read R1
>from
>coordinator node C, that reaches node1, node2
>State seen by C after first read R1:  <node1 = val1, node2 = val 2, node3
>=
>val1>
>
>A second read is initiated as a part of repair for consistent read of R1.
>This second read observes the values (val1, val2) from (node1, node2) and
>sends the corresponding row repair delta to node1. I'm guessing C cannot
>respond back to user with val2 until C knows that node1 has actually
>written
>the value val2 thereby meeting the quorum. Is this interpretation correct
>?
>
>
>
>
>
>
>--
>View this message in context:
>http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does
>-ReadRepair-exactly-do-tp7583261p7583395.html
>Sent from the cassandra-user@incubator.apache.org mailing list archive at
>Nabble.com.

Re: What does ReadRepair exactly do?

Posted by shankarpnsn <sh...@gmail.com>.

aaron morton wrote
>> 2. You do a write operation (W1) with quorom of val=2
>> node1 = val1 node2 = val2 node3 = val1  (write val2 is not complete yet)
> If the write has not completed then it is not a successful write at the
> specified CL as it could fail now.
> 
> Therefor the R +W > N Strong Consistency guarantee does not apply at this
> exact point in time. A read to the cluster at this exact point in time
> using QUOURM may return val2 or val1. Again the operation W1 has not
> completed, if read R' starts and completes while W1 is processing it may
> or may not return the result of W1.

I agree completely that it is fair to have this indeterminism in case of
partial/failed/in-flight writes, based on what nodes respond to a subsequent
read. 


aaron morton wrote
> It's import to point out the difference between Read Repair, in the
> context of the read_repair_chance setting, and Consistent Reads in the
> context of the CL setting. All of this is outside of the processing of
> your read request. It is separate from the stuff below.
> 
> Inside the user read request when ReadCallback.get() is called and CL
> nodes have responded the responses are compared. If a DigestMismatch
> happens then a Row Repair read is started, the result of this read is
> returned to the user. This Row Repair read MAY detect differences, if it
> does it resolves the super set, sends the delta to the replicas and
> returns the super set value to be returned to the client. 
> 
>> In this case, for read R1, the value val2 does not have a quorum. Would
>> read
>> R1 return val2 or val4 ? 
> 
> If val4 is in the memtable on node before the second read the result will
> be val4.  
> Writes that happen between the initial read and the second read after a
> Digest Mismatch are included in the read result.

Thanks for clarifying this, Aaron. This is very much in line with what I
figured out from the code and brings me back to my initial question on the
point of when and what the user/client gets to see as the read result. Let
us, for now, consider only the repairs initiated as a part of /consistent
reads/. If the Row Repair (after resolving and sending the deltas to
replicas, but not waiting for a quorum success after the repair) returns the
super set value immediately to the user, wouldn't it be a breach of the
consistent reads paradigm? My intuition behind saying this is because we
would respond to the client without the replicas having confirmed their
meeting the consistency requirement.

I agree that returning val4 is the right thing to do if quorum (two) nodes
among (node1,node2,node3) have the val4 at the second read after digest
mismatch. But wouldn't it be incorrect to respond to user with any value
when the second read (after mismatch) doesn't find a quorum. So after
sending the deltas to the replicas as a part of the repair (still a part of
/consistent reads/), shouldn't the value be read again to check for the
presence of a quorum after the repair?  

In the example we had, assume the mismatch is detected during a read R1 from
coordinator node C, that reaches node1, node2
State seen by C after first read R1:  <node1 = val1, node2 = val 2, node3 =
val1>

A second read is initiated as a part of repair for consistent read of R1.
This second read observes the values (val1, val2) from (node1, node2) and
sends the corresponding row repair delta to node1. I'm guessing C cannot
respond back to user with val2 until C knows that node1 has actually written
the value val2 thereby meeting the quorum. Is this interpretation correct ?






--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583395.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: What does ReadRepair exactly do?

Posted by aaron morton <aa...@thelastpickle.com>.

It's import to point out the difference between Read Repair, in the context of the read_repair_chance setting, and Consistent Reads in the context of the CL setting. 

If RR is active on a request it means the request is sent to ALL UP nodes for the key and the RR process is ASYNC to the request.    If all of the nodes involved in the request return to the coordinator before rpc_timeout ReadCallback.maybeResolveForRepair() will put a repair task into the READ_REPAIR stage. This will compare the values and IF there is a DigestMismatch it will start a Row Repair read that reads the data from all nodes and MAY result in differences being detected and fixed. 

All of this is outside of the processing of your read request. It is separate from the stuff below.

Inside the user read request when ReadCallback.get() is called and CL nodes have responded the responses are compared. If a DigestMismatch happens then a Row Repair read is started, the result of this read is returned to the user. This Row Repair read MAY detect differences, if it does it resolves the super set, sends the delta to the replicas and returns the super set value to be returned to the client. 

> I'm still missing, how read repairs behave. Just extending your example for
> the following case: 
The example does not use Read Repair, it is handled by Consistent Reads. 

The purpose of RR is to reduce the probability that a read in the future using any of the replicas will result in a Digest Mismatch. "Any of the replicas" means ones that were not necessary for this specific read request. 

> 2. You do a write operation (W1) with quorom of val=2
> node1 = val1 node2 = val2 node3 = val1  (write val2 is not complete yet)
If the write has not completed then it is not a successful write at the specified CL as it could fail now.

Therefor the R +W > N Strong Consistency guarantee does not apply at this exact point in time. A read to the cluster at this exact point in time using QUOURM may return val2 or val1. Again the operation W1 has not completed, if read R' starts and completes while W1 is processing it may or may not return the result of W1.

> In this case, for read R1, the value val2 does not have a quorum. Would read
> R1 return val2 or val4 ? 

If val4 is in the memtable on node before the second read the result will be val4.  
Writes that happen between the initial read and the second read after a Digest Mismatch are included in the read result.

The way I think about consistency is "what value do reads see if writes stop":

* If you have R + W > N, so all writes succeeded at CL QUOURM, all successful reads are guaranteed to see the last write. 
* If you are using a low CL and/or had a failed writes at QUOURM then R +  W < N. All successful reads will *eventually* see the last value written, and they are guaranteed to return the value of a previous write or no value. Eventually background Read Repair, Hinted Handoff  or nodetool repair will repair the inconsistency. 

Hope that helps. 

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 25/10/2012, at 4:39 AM, "Hiller, Dean" <De...@nrel.gov> wrote:

>> Thanks Zhang. But, this again seems a little strange thing to do, since
>> one
>> (say R2) of the 2 close replicas (say R1,R2) might be down, resulting in a
>> read failure while there are still enough number of replicas (R1 and R3)
>> live to satisfy a read.
> 
> 
> He means in the case where all 3 nodes are liveŠ.if a node is down,
> naturally it redirects to the other node and still succeeds because it
> found 2 nodes even with one node down(feel free to test this live though
> !!!!!)
> 
>> 
>> Thanks for the example Dean. This definitely clears things up when you
>> have
>> an overlap between the read and the write, and one comes after the other.
>> I'm still missing, how read repairs behave. Just extending your example
>> for
>> the following case:
>> 
>> 1. node1 = val1 node2 = val1 node3 = val1
>> 
>> 2. You do a write operation (W1) with quorom of val=2
>> node1 = val1 node2 = val2 node3 = val1  (write val2 is not complete yet)
>> 
>> 3. Now with a read (R1) from node1 and node2, a read repair will be
>> initiated that needs to write val2 on node 1.
>> node1 = val1; node2 = val2; node3 = val1  (read repair val2 is not
>> complete
>> yet)
>> 
>> 4. Say, in the meanwhile node 1 receives a write val 4; Read repair for R1
>> now arrives at node 1 but sees a newer value val4.
>> node1 = val4; node2 = val2; node3 = val1  (write val4 is not complete,
>> read
>> repair val2 not complete)
>> 
>> In this case, for read R1, the value val2 does not have a quorum. Would
>> read
>> R1 return val2 or val4 ?
> 
>> 
> At this point as Manu suggests, you need to look at the code but most
> likely what happens is they lock that row, receive the write in memory(ie.
> Not losing it) and return to client, caching it so as soon as read-repair
> is over, it will write that next value.  Ie. Your client would receive
> val2 and val4 would be the value in the database right after you received
> val2.  Ie. When a client interacts with cassandra and you have tons of
> writes to a row, val1, val2, val3, val4 in a short time period, just like
> a normal database, your client may get one of those 4 values depending on
> here the read gets inserted in the order of the writesŠsame as a normal
> RDBMS.  The only thing you don't have is the atomic nature with other rows.
> 
> NOTICE: they would not have to cache val4 very long, and if a newer write
> came in, they would just replace it with that newer val and cache that one
> instead so it would not be a queueŠbut this is all just a guessŠread the
> code if you really want to know.
> 
>> 
>> 
>> Zhang, Manu wrote
>>> And we don't send read request to all of the three replicas (R1, R2, R3)
>>> if CL=QUOROM; just 2 of them depending on proximity
>> 
>> 
>> 
>> 
>> 
>> --
>> View this message in context:
>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does
>> -ReadRepair-exactly-do-tp7583261p7583372.html
>> Sent from the cassandra-user@incubator.apache.org mailing list archive at
>> Nabble.com.
>

Re: What does ReadRepair exactly do?

Posted by "Hiller, Dean" <De...@nrel.gov>.

>Thanks Zhang. But, this again seems a little strange thing to do, since
>one
>(say R2) of the 2 close replicas (say R1,R2) might be down, resulting in a
>read failure while there are still enough number of replicas (R1 and R3)
>live to satisfy a read.


He means in the case where all 3 nodes are liveŠ.if a node is down,
naturally it redirects to the other node and still succeeds because it
found 2 nodes even with one node down(feel free to test this live though
!!!!!)

>
>Thanks for the example Dean. This definitely clears things up when you
>have
>an overlap between the read and the write, and one comes after the other.
>I'm still missing, how read repairs behave. Just extending your example
>for
>the following case:
>
>1. node1 = val1 node2 = val1 node3 = val1
>
>2. You do a write operation (W1) with quorom of val=2
>node1 = val1 node2 = val2 node3 = val1  (write val2 is not complete yet)
>
>3. Now with a read (R1) from node1 and node2, a read repair will be
>initiated that needs to write val2 on node 1.
>node1 = val1; node2 = val2; node3 = val1  (read repair val2 is not
>complete
>yet)
>
>4. Say, in the meanwhile node 1 receives a write val 4; Read repair for R1
>now arrives at node 1 but sees a newer value val4.
>node1 = val4; node2 = val2; node3 = val1  (write val4 is not complete,
>read
>repair val2 not complete)
>
>In this case, for read R1, the value val2 does not have a quorum. Would
>read
>R1 return val2 or val4 ?

> 
At this point as Manu suggests, you need to look at the code but most
likely what happens is they lock that row, receive the write in memory(ie.
Not losing it) and return to client, caching it so as soon as read-repair
is over, it will write that next value.  Ie. Your client would receive
val2 and val4 would be the value in the database right after you received
val2.  Ie. When a client interacts with cassandra and you have tons of
writes to a row, val1, val2, val3, val4 in a short time period, just like
a normal database, your client may get one of those 4 values depending on
here the read gets inserted in the order of the writesŠsame as a normal
RDBMS.  The only thing you don't have is the atomic nature with other rows.

NOTICE: they would not have to cache val4 very long, and if a newer write
came in, they would just replace it with that newer val and cache that one
instead so it would not be a queueŠbut this is all just a guessŠread the
code if you really want to know.

>
>
>Zhang, Manu wrote
>> And we don't send read request to all of the three replicas (R1, R2, R3)
>> if CL=QUOROM; just 2 of them depending on proximity
>
>
>
>
>
>--
>View this message in context:
>http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does
>-ReadRepair-exactly-do-tp7583261p7583372.html
>Sent from the cassandra-user@incubator.apache.org mailing list archive at
>Nabble.com.

Re: What does ReadRepair exactly do?

Posted by shankarpnsn <sh...@gmail.com>.

Hiller, Dean wrote
> I guess one more thing is I completely ignore your second write mainly
> because I assume it comes after we already read so your let's say you
> current state is
> 
> node1 = val1 node2 = val1 node3 = val1
> 
> You do a write quorom of val=2 which is IN the middle!!!
> 
> node1 = val1 node2 = val2 node3 = val1  (NOTICE the write is not complete
> yet)
> 
> If you read from node1 and node3, you get val1.  If you read from node1
> and node2, you get val2 as a read repair will happen.
> 
> Ie. You always get the older value or newer value.
> 
> If you have two writes come in like so
> 
> node1 = val1 node2 = val2 and node3= val3
> 
> Well, I think you can figure it out when you do a read ;).  If your read
> quorum reads from node1 and node3 , you get val3, etc. etc.
> 
> This is basically how it works….If your scenario is a web page, a user
> simply hits the refresh button and sees the values changing. I'm extending
> your example 
> 
> Later,
> Dean

Thanks for the example Dean. This definitely clears things up when you have
an overlap between the read and the write, and one comes after the other.
I'm still missing, how read repairs behave. Just extending your example for
the following case: 

1. node1 = val1 node2 = val1 node3 = val1

2. You do a write operation (W1) with quorom of val=2
node1 = val1 node2 = val2 node3 = val1  (write val2 is not complete yet)

3. Now with a read (R1) from node1 and node2, a read repair will be
initiated that needs to write val2 on node 1.  
node1 = val1; node2 = val2; node3 = val1  (read repair val2 is not complete
yet)

4. Say, in the meanwhile node 1 receives a write val 4; Read repair for R1
now arrives at node 1 but sees a newer value val4.
node1 = val4; node2 = val2; node3 = val1  (write val4 is not complete, read
repair val2 not complete)

In this case, for read R1, the value val2 does not have a quorum. Would read
R1 return val2 or val4 ? 


Zhang, Manu wrote
> And we don't send read request to all of the three replicas (R1, R2, R3)
> if CL=QUOROM; just 2 of them depending on proximity

Thanks Zhang. But, this again seems a little strange thing to do, since one
(say R2) of the 2 close replicas (say R1,R2) might be down, resulting in a
read failure while there are still enough number of replicas (R1 and R3)
live to satisfy a read. 



--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583372.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: What does ReadRepair exactly do?

Posted by "Hiller, Dean" <De...@nrel.gov>.

I guess one more thing is I completely ignore your second write mainly because I assume it comes after we already read so your let's say you current state is

node1 = val1 node2 = val1 node3 = val1

You do a write quorom of val=2 which is IN the middle!!!

node1 = val1 node2 = val2 node3 = val1  (NOTICE the write is not complete yet)

If you read from node1 and node3, you get val1.  If you read from node1 and node2, you get val2 as a read repair will happen.

Ie. You always get the older value or newer value.

If you have two writes come in like so

node1 = val1 node2 = val2 and node3= val3

Well, I think you can figure it out when you do a read ;).  If your read quorum reads from node1 and node3 , you get val3, etc. etc.

This is basically how it works….If your scenario is a web page, a user simply hits the refresh button and sees the values changing.

Later,
Dean

From: Manu Zhang <ow...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Wednesday, October 24, 2012 8:26 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: What does ReadRepair exactly do?

And we don't send read request to all of the three replicas (R1, R2, R3) if CL=QUOROM; just 2 of them depending on proximity

On Wed, Oct 24, 2012 at 10:20 PM, Hiller, Dean <De...@nrel.gov>> wrote:
The user will meet the required consistency unless you encounter some kind
of bug in cassandra.  You will either get the older value or the newer
value. If you read quorum, and maybe a write CL=1 just happened, you may
get the older or new value depending on if the node that received the
write was involved.  If you read quorum and your wrote CL=QUOROM, then you
may get the newer value or the older value depending on who gets their
first so to speak.

In your scenario, if the read repair read from R2 just before the write is
applied, you get the old value.  If it read from R2 just after the write
was applied, it gets the new value.  BOTH of these met the consistency
constraint.  A better example to clear this up may be the following...  If
you read a value at CL=QUOROM, and you have a write 20ms later, you get
the old value, right?  And it met the consistency level, right?  NOW, what
about if the write is 1ms later?  What if it the right is .00001ms later?
It still met the consistency level, right?  If it is .00001ms before, you
get the new value as it repairs first with the new node.

It is just when programming, your read may get the newer value or older
value and generally if you write the code in a way that works, this
concept works out great in most cases(in some cases, you need to think a
bit differently and solve it other ways).

I hope that clears it up

Later,
Dean

On 10/24/12 8:02 AM, "shankarpnsn" <sh...@gmail.com>> wrote:

>Hiller, Dean wrote
>> in general it is okay to get the older or newer value.  If you are
>>reading
>> 2 rows however instead of one, that may change.
>
>This is certainly interesting, as it could mean that the user could see a
>value that never met the required consistency. For instance with 3
>replicas
><R1,R2,R3> and a quorum consistency, assume that R1 is initiating a read
>(becomes the coordinator) - notices a conflict with R2 (assume R1 has a
>more
>recent value) and initiates a read repair with its value. Meanwhile R2 and
>R3 have seen two different writes with newer values than what was computed
>by the read repair. If R1 were to respond back to the user with the value
>that was computed at the time of read repair, wouldn't it be a value that
>never met the consistency constraint? I was thinking if this should
>trigger
>another round of repair that tries to reach the consistency constraint
>with
>a newer value or time-out, which is the expected case when you don't meet
>the required consistency. Please let me know if I'm missing something
>here.
>
>
>
>--
>View this message in context:
>http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does
>-ReadRepair-exactly-do-tp7583261p7583366.html
>Sent from the cassandra-user@incubator.apache.org<ma...@incubator.apache.org> mailing list archive at
>Nabble.com.

Re: What does ReadRepair exactly do?

Posted by Manu Zhang <ow...@gmail.com>.

And we don't send read request to all of the three replicas (R1, R2, R3) if
CL=QUOROM; just 2 of them depending on proximity

On Wed, Oct 24, 2012 at 10:20 PM, Hiller, Dean <De...@nrel.gov> wrote:

> The user will meet the required consistency unless you encounter some kind
> of bug in cassandra.  You will either get the older value or the newer
> value. If you read quorum, and maybe a write CL=1 just happened, you may
> get the older or new value depending on if the node that received the
> write was involved.  If you read quorum and your wrote CL=QUOROM, then you
> may get the newer value or the older value depending on who gets their
> first so to speak.
>
> In your scenario, if the read repair read from R2 just before the write is
> applied, you get the old value.  If it read from R2 just after the write
> was applied, it gets the new value.  BOTH of these met the consistency
> constraint.  A better example to clear this up may be the following...  If
> you read a value at CL=QUOROM, and you have a write 20ms later, you get
> the old value, right?  And it met the consistency level, right?  NOW, what
> about if the write is 1ms later?  What if it the right is .00001ms later?
> It still met the consistency level, right?  If it is .00001ms before, you
> get the new value as it repairs first with the new node.
>
> It is just when programming, your read may get the newer value or older
> value and generally if you write the code in a way that works, this
> concept works out great in most cases(in some cases, you need to think a
> bit differently and solve it other ways).
>
> I hope that clears it up
>
> Later,
> Dean
>
> On 10/24/12 8:02 AM, "shankarpnsn" <sh...@gmail.com> wrote:
>
> >Hiller, Dean wrote
> >> in general it is okay to get the older or newer value.  If you are
> >>reading
> >> 2 rows however instead of one, that may change.
> >
> >This is certainly interesting, as it could mean that the user could see a
> >value that never met the required consistency. For instance with 3
> >replicas
> ><R1,R2,R3> and a quorum consistency, assume that R1 is initiating a read
> >(becomes the coordinator) - notices a conflict with R2 (assume R1 has a
> >more
> >recent value) and initiates a read repair with its value. Meanwhile R2 and
> >R3 have seen two different writes with newer values than what was computed
> >by the read repair. If R1 were to respond back to the user with the value
> >that was computed at the time of read repair, wouldn't it be a value that
> >never met the consistency constraint? I was thinking if this should
> >trigger
> >another round of repair that tries to reach the consistency constraint
> >with
> >a newer value or time-out, which is the expected case when you don't meet
> >the required consistency. Please let me know if I'm missing something
> >here.
> >
> >
> >
> >--
> >View this message in context:
> >
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does
> >-ReadRepair-exactly-do-tp7583261p7583366.html
> >Sent from the cassandra-user@incubator.apache.org mailing list archive at
> >Nabble.com.
>
>

Re: What does ReadRepair exactly do?

Posted by "Hiller, Dean" <De...@nrel.gov>.

The user will meet the required consistency unless you encounter some kind
of bug in cassandra.  You will either get the older value or the newer
value. If you read quorum, and maybe a write CL=1 just happened, you may
get the older or new value depending on if the node that received the
write was involved.  If you read quorum and your wrote CL=QUOROM, then you
may get the newer value or the older value depending on who gets their
first so to speak. 

In your scenario, if the read repair read from R2 just before the write is
applied, you get the old value.  If it read from R2 just after the write
was applied, it gets the new value.  BOTH of these met the consistency
constraint.  A better example to clear this up may be the following...  If
you read a value at CL=QUOROM, and you have a write 20ms later, you get
the old value, right?  And it met the consistency level, right?  NOW, what
about if the write is 1ms later?  What if it the right is .00001ms later?
It still met the consistency level, right?  If it is .00001ms before, you
get the new value as it repairs first with the new node.

It is just when programming, your read may get the newer value or older
value and generally if you write the code in a way that works, this
concept works out great in most cases(in some cases, you need to think a
bit differently and solve it other ways).

I hope that clears it up

Later,
Dean

On 10/24/12 8:02 AM, "shankarpnsn" <sh...@gmail.com> wrote:

>Hiller, Dean wrote
>> in general it is okay to get the older or newer value.  If you are
>>reading
>> 2 rows however instead of one, that may change.
>
>This is certainly interesting, as it could mean that the user could see a
>value that never met the required consistency. For instance with 3
>replicas
><R1,R2,R3> and a quorum consistency, assume that R1 is initiating a read
>(becomes the coordinator) - notices a conflict with R2 (assume R1 has a
>more
>recent value) and initiates a read repair with its value. Meanwhile R2 and
>R3 have seen two different writes with newer values than what was computed
>by the read repair. If R1 were to respond back to the user with the value
>that was computed at the time of read repair, wouldn't it be a value that
>never met the consistency constraint? I was thinking if this should
>trigger
>another round of repair that tries to reach the consistency constraint
>with
>a newer value or time-out, which is the expected case when you don't meet
>the required consistency. Please let me know if I'm missing something
>here. 
>
>
>
>--
>View this message in context:
>http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does
>-ReadRepair-exactly-do-tp7583261p7583366.html
>Sent from the cassandra-user@incubator.apache.org mailing list archive at
>Nabble.com.

Re: What does ReadRepair exactly do?

Posted by shankarpnsn <sh...@gmail.com>.

Hiller, Dean wrote
> in general it is okay to get the older or newer value.  If you are reading
> 2 rows however instead of one, that may change.

This is certainly interesting, as it could mean that the user could see a
value that never met the required consistency. For instance with 3 replicas
<R1,R2,R3> and a quorum consistency, assume that R1 is initiating a read
(becomes the coordinator) - notices a conflict with R2 (assume R1 has a more
recent value) and initiates a read repair with its value. Meanwhile R2 and
R3 have seen two different writes with newer values than what was computed
by the read repair. If R1 were to respond back to the user with the value
that was computed at the time of read repair, wouldn't it be a value that
never met the consistency constraint? I was thinking if this should trigger
another round of repair that tries to reach the consistency constraint with
a newer value or time-out, which is the expected case when you don't meet
the required consistency. Please let me know if I'm missing something here. 



--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583366.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: What does ReadRepair exactly do?

Posted by "Hiller, Dean" <De...@nrel.gov>.

Keep in mind, returning the older version is usually fine.  Just imagine
if your user clicked write 1 ms before, then the new version might be
returned.  If he gets the older version and refreshes the page, he gets
the newer version.  Same with an automated program as wellŠ.in general it
is okay to get the older or newer value.  If you are reading 2 rows
however instead of one, that may change.

Dean

On 10/23/12 7:04 PM, "shankarpnsn" <sh...@gmail.com> wrote:

>manuzhang wrote
>> why repair again? We block until the consistency constraint is met. Then
>> the latest version is returned and repair is done asynchronously if any
>> mismatch. We may retry read if fewer columns than required are returned.
>
>Just to make sure I understand you correct, considering the case when a
>read
>repair is in flight and a subsequent write affects one or more of the
>replicas that was scheduled to received the repair mutations. In this
>case,
>are you saying that we return the older version to the user rather than
>the
>latest version that was effected by the write ?
>
>
>
>--
>View this message in context:
>http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does
>-ReadRepair-exactly-do-tp7583261p7583355.html
>Sent from the cassandra-user@incubator.apache.org mailing list archive at
>Nabble.com.

Re: What does ReadRepair exactly do?

Posted by shankarpnsn <sh...@gmail.com>.

manuzhang wrote
> why repair again? We block until the consistency constraint is met. Then
> the latest version is returned and repair is done asynchronously if any
> mismatch. We may retry read if fewer columns than required are returned.

Just to make sure I understand you correct, considering the case when a read
repair is in flight and a subsequent write affects one or more of the
replicas that was scheduled to received the repair mutations. In this case,
are you saying that we return the older version to the user rather than the
latest version that was effected by the write ?



--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583355.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: What does ReadRepair exactly do?

Posted by Manu Zhang <ow...@gmail.com>.

why repair again? We block until the consistency constraint is met. Then
the latest version is returned and repair is done asynchronously if any
mismatch. We may retry read if fewer columns than required are returned.

On Wed, Oct 24, 2012 at 6:10 AM, shankarpnsn <sh...@gmail.com> wrote:

> Hello,
>
> This conversation precisely targets a question that I had been having for a
> while - would be grateful if you someone cloud clarify it a little further:
>
> Considering the case of a "repair" created due to a consistency constraint
> (first case in the discussion above), would the following interpretation be
> correct ?
>
> 1. A digest mismatch exception is raised even if one among the many
> responses (even if consistency is met on an out-of-date value, say by
> virtue
> of timestamp).
> 2. A read is initiated by the callback to fetch data from all replicas
> 3. Resolve() is invoked to find the deltas for each replica that was out of
> date.
> 4. ReadRepair is scheduled to the above replicas.
> 5. Perform a normal read and check if this meets the consistency
> constraints. Mismatches would trigger a repair again.
>
> Assuming the above is true, would the mutations in step 4 and the read in
> step 5 happen in parallel ? In other words, would the time taken by the
> read
> correction be the round trip between the coordinator and its farthest
> replica that meets the consistency constraint.
>
> Thanks,
> Shankar
>
>
>
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583352.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at
> Nabble.com.
>

Re: What does ReadRepair exactly do?

Posted by shankarpnsn <sh...@gmail.com>.

Hello, 

This conversation precisely targets a question that I had been having for a
while - would be grateful if you someone cloud clarify it a little further: 

Considering the case of a "repair" created due to a consistency constraint
(first case in the discussion above), would the following interpretation be
correct ?

1. A digest mismatch exception is raised even if one among the many
responses (even if consistency is met on an out-of-date value, say by virtue
of timestamp).
2. A read is initiated by the callback to fetch data from all replicas
3. Resolve() is invoked to find the deltas for each replica that was out of
date. 
4. ReadRepair is scheduled to the above replicas. 
5. Perform a normal read and check if this meets the consistency
constraints. Mismatches would trigger a repair again. 

Assuming the above is true, would the mutations in step 4 and the read in
step 5 happen in parallel ? In other words, would the time taken by the read
correction be the round trip between the coordinator and its farthest
replica that meets the consistency constraint.  

Thanks,
Shankar



--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-does-ReadRepair-exactly-do-tp7583261p7583352.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: What does ReadRepair exactly do?

Posted by Shankaranarayanan P N <sh...@gmail.com>.

Hello,

This conversation precisely targets a question that I had been having for a
while - would be grateful if you someone cloud clarify it a little further:

Considering the case of a "repair" created due to a consistency constraint
(first case in the discussion above), would the following interpretation be
correct ?

1. A digest mismatch exception is raised even if one among the many
responses (even if consistency is met on an out-of-date value, say by
virtue of timestamp).
2. A read is initiated by the callback to fetch data from all replicas
3. Resolve() is invoked to find the deltas for each replica that was out of
date.
4. ReadRepair is scheduled to the above replicas.
5. Perform a normal read and check if this meets the consistency
constraints. Mismatches would trigger a repair again.

Assuming the above is true, would the mutations in step 4 and the read in
step 5 happen in parallel ? In other words, would the time taken by the
read correction be the round trip between the coordinator and its farthest
replica that meets the consistency constraint.

Thanks,
Shankar


On Tue, Oct 23, 2012 at 3:17 AM, aaron morton <aa...@thelastpickle.com>wrote:

> Yes, all this starts because of the call to filter.collateColumns()…
>
> The ColumnFamily is an implementation of o.a.c.dbAbstractColumnContainer ,
> the methods to add columns on that interface pass through to an
> implementation of ISortedColumns.
>
> The implementations of ISortedColumns, e.g. ArrayBackedSortedColumns, will
> call reconcile() on the IColumn if they need to.
>
> Cheers
>
>   -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 23/10/2012, at 4:45 AM, Manu Zhang <ow...@gmail.com> wrote:
>
> Is it through filter.collateColumns(resolved, iters, Integer.MIN_VALUE)
> and then MergeIterator.get(toCollate, fcomp, reducer) but I don't know what
> happens hereafter? How is reconcile exactly been called?
>
> On Mon, Oct 22, 2012 at 6:49 AM, aaron morton <aa...@thelastpickle.com>wrote:
>
>> There are two processes in cassandra that trigger Read Repair like
>> behaviour.
>>
>> During a DigestMismatchException is raised if the responses from the
>> replicas do not match. In this case another read is run that involves
>> reading all the data. This is the CL level agreement kicking in.
>>
>> The other "Read Repair" is the one controlled by the
>> "read_repair_chance". When RR is active on a request ALL up replicas are
>> involved in the read. When RR is not active only CL replicas are involved.
>> When test for CL agreement occurs synchronously to the request; the RR
>> check waits asynchronously to the request for all nodes in the request to
>> return. It then checks for consistency and repairs differences.
>>
>> From looking at the source code, I do not understand how this set is
>> built and I do not understand how the reconciliation is executed.
>>
>> When a DigestMismatch is detected a read is run using RepairCallback. The
>> callback will call the RowRepairResolver.resolve() when enough responses
>> have been collected.
>>
>> resolveSuperset() picks one response to the baseline, and then calls
>> delete() to apply row level deletes from the other responses
>> (ColumnFamily's). It collects the other CF's into an iterator with a filter
>> that returns all columns. The columns are then applied to the baseline CF
>> which may result in reconcile() being called.
>>
>> reconcile() is used when a AbstractColumnContainer has two versions of a
>> column and it wants to only have one.
>>
>> RowRepairResolve.scheduleRepairs() works out the delta for each node by
>> calling ColumnFamily.diff(). The delta is then sent to the appropriate node.
>>
>>
>> Hope that helps.
>>
>>
>>   -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 19/10/2012, at 6:33 AM, Markus Klems <ma...@gmail.com> wrote:
>>
>> Hi guys,
>>
>> I am looking through the Cassandra source code in the github trunk to
>> better understand how Cassandra's fault-tolerance mechanisms work. Most
>> things make sense. I am also aware of the wiki and DataStax documentation.
>> However, I do not understand what read repair does in detail. The method
>> RowRepairResolver.resolveSuperset(Iterable<ColumnFamily> versions) seems to
>> do the trick of merging conflicting versions of column family replicas and
>> builds the set of columns that need to be "repaired". From looking at the
>> source code, I do not understand how this set is built and I do not
>> understand how the reconciliation is executed. ReadRepair does not seem to
>> trigger a Column.reconcile() to reconcile conflicting column versions on
>> different servers. Does it?
>>
>> If this is not what read repair does, then: What kind of inconsistencies
>> are resolved by read repair? And: How are the inconsistencies resolved?
>>
>> Could someone give me a hint?
>>
>> Thanks so much,
>>
>> -Markus
>>
>>
>>
>
>

Re: What does ReadRepair exactly do?

Posted by aaron morton <aa...@thelastpickle.com>.

Yes, all this starts because of the call to filter.collateColumns()…

The ColumnFamily is an implementation of o.a.c.dbAbstractColumnContainer , the methods to add columns on that interface pass through to an implementation of ISortedColumns. 

The implementations of ISortedColumns, e.g. ArrayBackedSortedColumns, will call reconcile() on the IColumn if they need to. 

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 23/10/2012, at 4:45 AM, Manu Zhang <ow...@gmail.com> wrote:

> Is it through filter.collateColumns(resolved, iters, Integer.MIN_VALUE) and then MergeIterator.get(toCollate, fcomp, reducer) but I don't know what happens hereafter? How is reconcile exactly been called?
> 
> On Mon, Oct 22, 2012 at 6:49 AM, aaron morton <aa...@thelastpickle.com> wrote:
> There are two processes in cassandra that trigger Read Repair like behaviour. 
> 
> During a DigestMismatchException is raised if the responses from the replicas do not match. In this case another read is run that involves reading all the data. This is the CL level agreement kicking in. 
> 
> The other "Read Repair" is the one controlled by the "read_repair_chance". When RR is active on a request ALL up replicas are involved in the read. When RR is not active only CL replicas are involved. When test for CL agreement occurs synchronously to the request; the RR check waits asynchronously to the request for all nodes in the request to return. It then checks for consistency and repairs differences. 
> 
>> From looking at the source code, I do not understand how this set is built and I do not understand how the reconciliation is executed.
> When a DigestMismatch is detected a read is run using RepairCallback. The callback will call the RowRepairResolver.resolve() when enough responses have been collected. 
> 
> resolveSuperset() picks one response to the baseline, and then calls delete() to apply row level deletes from the other responses (ColumnFamily's). It collects the other CF's into an iterator with a filter that returns all columns. The columns are then applied to the baseline CF which may result in reconcile() being called. 
> 
> reconcile() is used when a AbstractColumnContainer has two versions of a column and it wants to only have one. 
> 
> RowRepairResolve.scheduleRepairs() works out the delta for each node by calling ColumnFamily.diff(). The delta is then sent to the appropriate node.
> 
> 
> Hope that helps. 
> 
> 
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 19/10/2012, at 6:33 AM, Markus Klems <ma...@gmail.com> wrote:
> 
>> Hi guys,
>> 
>> I am looking through the Cassandra source code in the github trunk to better understand how Cassandra's fault-tolerance mechanisms work. Most things make sense. I am also aware of the wiki and DataStax documentation. However, I do not understand what read repair does in detail. The method RowRepairResolver.resolveSuperset(Iterable<ColumnFamily> versions) seems to do the trick of merging conflicting versions of column family replicas and builds the set of columns that need to be "repaired". From looking at the source code, I do not understand how this set is built and I do not understand how the reconciliation is executed. ReadRepair does not seem to trigger a Column.reconcile() to reconcile conflicting column versions on different servers. Does it?
>> 
>> If this is not what read repair does, then: What kind of inconsistencies are resolved by read repair? And: How are the inconsistencies resolved?
>> 
>> Could someone give me a hint?
>> 
>> Thanks so much,
>> 
>> -Markus
> 
>

Re: What does ReadRepair exactly do?

Posted by Manu Zhang <ow...@gmail.com>.

Is it through filter.collateColumns(resolved, iters, Integer.MIN_VALUE) and
then MergeIterator.get(toCollate, fcomp, reducer) but I don't know what
happens hereafter? How is reconcile exactly been called?

On Mon, Oct 22, 2012 at 6:49 AM, aaron morton <aa...@thelastpickle.com>wrote:

> There are two processes in cassandra that trigger Read Repair like
> behaviour.
>
> During a DigestMismatchException is raised if the responses from the
> replicas do not match. In this case another read is run that involves
> reading all the data. This is the CL level agreement kicking in.
>
> The other "Read Repair" is the one controlled by the "read_repair_chance".
> When RR is active on a request ALL up replicas are involved in the read.
> When RR is not active only CL replicas are involved. When test for CL
> agreement occurs synchronously to the request; the RR check
> waits asynchronously to the request for all nodes in the request to return.
> It then checks for consistency and repairs differences.
>
> From looking at the source code, I do not understand how this set is built
> and I do not understand how the reconciliation is executed.
>
> When a DigestMismatch is detected a read is run using RepairCallback. The
> callback will call the RowRepairResolver.resolve() when enough responses
> have been collected.
>
> resolveSuperset() picks one response to the baseline, and then calls
> delete() to apply row level deletes from the other responses
> (ColumnFamily's). It collects the other CF's into an iterator with a filter
> that returns all columns. The columns are then applied to the baseline CF
> which may result in reconcile() being called.
>
> reconcile() is used when a AbstractColumnContainer has two versions of a
> column and it wants to only have one.
>
> RowRepairResolve.scheduleRepairs() works out the delta for each node by
> calling ColumnFamily.diff(). The delta is then sent to the appropriate node.
>
>
> Hope that helps.
>
>
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 19/10/2012, at 6:33 AM, Markus Klems <ma...@gmail.com> wrote:
>
> Hi guys,
>
> I am looking through the Cassandra source code in the github trunk to
> better understand how Cassandra's fault-tolerance mechanisms work. Most
> things make sense. I am also aware of the wiki and DataStax documentation.
> However, I do not understand what read repair does in detail. The method
> RowRepairResolver.resolveSuperset(Iterable<ColumnFamily> versions) seems to
> do the trick of merging conflicting versions of column family replicas and
> builds the set of columns that need to be "repaired". From looking at the
> source code, I do not understand how this set is built and I do not
> understand how the reconciliation is executed. ReadRepair does not seem to
> trigger a Column.reconcile() to reconcile conflicting column versions on
> different servers. Does it?
>
> If this is not what read repair does, then: What kind of inconsistencies
> are resolved by read repair? And: How are the inconsistencies resolved?
>
> Could someone give me a hint?
>
> Thanks so much,
>
> -Markus
>
>
>

Re: What does ReadRepair exactly do?

Posted by aaron morton <aa...@thelastpickle.com>.

There are two processes in cassandra that trigger Read Repair like behaviour. 

During a DigestMismatchException is raised if the responses from the replicas do not match. In this case another read is run that involves reading all the data. This is the CL level agreement kicking in. 

The other "Read Repair" is the one controlled by the "read_repair_chance". When RR is active on a request ALL up replicas are involved in the read. When RR is not active only CL replicas are involved. When test for CL agreement occurs synchronously to the request; the RR check waits asynchronously to the request for all nodes in the request to return. It then checks for consistency and repairs differences. 

> From looking at the source code, I do not understand how this set is built and I do not understand how the reconciliation is executed.
When a DigestMismatch is detected a read is run using RepairCallback. The callback will call the RowRepairResolver.resolve() when enough responses have been collected. 

resolveSuperset() picks one response to the baseline, and then calls delete() to apply row level deletes from the other responses (ColumnFamily's). It collects the other CF's into an iterator with a filter that returns all columns. The columns are then applied to the baseline CF which may result in reconcile() being called. 

reconcile() is used when a AbstractColumnContainer has two versions of a column and it wants to only have one. 

RowRepairResolve.scheduleRepairs() works out the delta for each node by calling ColumnFamily.diff(). The delta is then sent to the appropriate node.

Hope that helps. 

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 19/10/2012, at 6:33 AM, Markus Klems <ma...@gmail.com> wrote:

> Hi guys,
> 
> I am looking through the Cassandra source code in the github trunk to better understand how Cassandra's fault-tolerance mechanisms work. Most things make sense. I am also aware of the wiki and DataStax documentation. However, I do not understand what read repair does in detail. The method RowRepairResolver.resolveSuperset(Iterable<ColumnFamily> versions) seems to do the trick of merging conflicting versions of column family replicas and builds the set of columns that need to be "repaired". From looking at the source code, I do not understand how this set is built and I do not understand how the reconciliation is executed. ReadRepair does not seem to trigger a Column.reconcile() to reconcile conflicting column versions on different servers. Does it?
> 
> If this is not what read repair does, then: What kind of inconsistencies are resolved by read repair? And: How are the inconsistencies resolved?
> 
> Could someone give me a hint?
> 
> Thanks so much,
> 
> -Markus