You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Roland Hänel <ro...@haenel.me> on 2010/04/28 09:02:32 UTC

Detailed behavior of insert() operation?

Does Cassandra make any guarantees on the outcome of a scenario like this:

Two clients insert the same key/colum with different values at the same
time:

   client A does insert(keyspace, key_1,
column_name_1, value_A, timestamp_1, consistency_level.QUORUM)
   client B does insert(keyspace, key_1,
column_name_1, value_B, timestamp_1, consistency_level.QUORUM)

After that, both clients read their value:

   client A does
get(keyspace, key_1, column_name_1, consistency_level.QUORUM)
   client B does
get(keyspace, key_1, column_name_1, consistency_level.QUORUM)

It is obvious that since the insert happens 'at the same time', i.e. with
the same timestamp, we cannot say
which value (value_A or value_B) gets written to the row. However, do we
have a guarantee that either value_A
or value_B is written, and that both read operations will return the same
result?

Or might there be a (short-timed?) race conditions where both get()
operations will return different results
(typically value_A for client_A, value_B for client_B)?

-Roland

Re: Detailed behavior of insert() operation?

Posted by Sylvain Lebresne <sy...@yakaz.com>.

> One last question (sorry to bother you): isn't the behavior of read repair
> strictly deterministic in this case? You say both read requests could try to
> read repair the result (each time in the opposite direction). Inside the
> read repair algorithm, when we have exactly the same timestamps, what value
> is elected for repair? The first one that the node got in the read request?
> If we make that deterministic, we could avoid this scenario, right?

This is deterministic but not centralized. You may have node A with value va
and node B with value vb. Then you read simultaneously on node A and B.
A gets vb from B with same timestamp and decide deterministically to
take the new
value. B does the same in the mean time. The point is, each node does
deterministically
the same thing. You could break that if the read repair was using some
total ordering on
the nodes to decide what to keep on ties (A and B would decide to keep
the version
of A for instance in my example). But there is no easy way at all to
do such things in
the current implementation.

>
> -Roland
>
>
>
> 2010/4/28 Jonathan Ellis <jb...@gmail.com>
>>
>> 2010/4/28 Roland Hänel <ro...@haenel.me>:
>> > Two clients insert the same key/colum with different values at the same
>> > time:
>> >
>> >    client A does insert(keyspace, key_1,
>> > column_name_1, value_A, timestamp_1, consistency_level.QUORUM)
>> >    client B does insert(keyspace, key_1,
>> > column_name_1, value_B, timestamp_1, consistency_level.QUORUM)
>> >
>> > After that, both clients read their value:
>> >
>> >    client A does
>> > get(keyspace, key_1, column_name_1, consistency_level.QUORUM)
>> >    client B does
>> > get(keyspace, key_1, column_name_1, consistency_level.QUORUM)
>> >
>> > It is obvious that since the insert happens 'at the same time', i.e.
>> > with
>> > the same timestamp, we cannot say
>> > which value (value_A or value_B) gets written to the row. However, do we
>> > have a guarantee that either value_A
>> > or value_B is written, and that both read operations will return the
>> > same
>> > result?
>>
>> The guarantee is that "eventually" you will get a consistent result.
>>
>> Say both writes overlap such that value A is present on replicas R1
>> and R2, and value B is present on replica R3 (after both writes
>> complete).
>>
>> Simultaneous read operations could then both attempt to "repair" the
>> other nodes, and again there could be overlap, resulting in still 2
>> values present, possibly on different nodes this time.
>>
>> So: you can see different values on reads when there are two
>> "simultaneous" writes, and this can continue in the worst-case
>> scenario until one read's repair can finish before another begins.
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of Riptano, the source for professional Cassandra support
>> http://riptano.com
>
>

Re: Detailed behavior of insert() operation?

Posted by Roland Hänel <ro...@haenel.me>.

Here is the ticket: https://issues.apache.org/jira/browse/CASSANDRA-1039

Thanks, Roland

2010/4/29 Jonathan Ellis <jb...@gmail.com>

> 2010/4/29 Roland Hänel <ro...@haenel.me>:
> > Imagine the following rule: if we are in doubt whether to repair a column
> > with timestamp T (because two values X and Y are present within the
> cluster,
> > both at timestamp T), then we always repair towards X if md5(X)<md5(Y).
> In
> > this case, even after an inconsistency on the first insert, this would be
> > cleared by any node that triggers a repair afterwards.
> >
> > And then you're done: a Cassandra can create a unique transaction ID by
> > inserting a column with the ID this clients wants to grab as key, and
> some
> > random stuff as value, then the clients reads the just inserted column,
> and
> > if the ID and the same random stuff is there - voila, the ID is unique
> for
> > this client.
>
> That sounds like an excellent idea to me.  Can you create a ticket for
> that at https://issues.apache.org/jira/browse/CASSANDRA ?
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

Re: Detailed behavior of insert() operation?

Posted by Jonathan Ellis <jb...@gmail.com>.

2010/4/29 Roland Hänel <ro...@haenel.me>:
> Imagine the following rule: if we are in doubt whether to repair a column
> with timestamp T (because two values X and Y are present within the cluster,
> both at timestamp T), then we always repair towards X if md5(X)<md5(Y). In
> this case, even after an inconsistency on the first insert, this would be
> cleared by any node that triggers a repair afterwards.
>
> And then you're done: a Cassandra can create a unique transaction ID by
> inserting a column with the ID this clients wants to grab as key, and some
> random stuff as value, then the clients reads the just inserted column, and
> if the ID and the same random stuff is there - voila, the ID is unique for
> this client.

That sounds like an excellent idea to me.  Can you create a ticket for
that at https://issues.apache.org/jira/browse/CASSANDRA ?

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Detailed behavior of insert() operation?

Posted by Roland Hänel <ro...@haenel.me>.

Jonathan, thanks for this pointer. I've new had a look at contrib/mutex.
Coming back to my point, the use of Zookeeper within Cassandra for the
purpose of then being able to deliver a "unique key generation function" out
of Cassandra seems like overkill, in this case the application could use
Zookeeper directly to accomplish this task.

However, I think we're already more than half-way without Zookeeper: the
problem is, that after the "concurrent" insert, both reads might try to
repair in the opposite direction, thus the inconsistency is not deleted but
rather only turned upside down. This could be prevented by introduction of a
tie-breaker.

Imagine the following rule: if we are in doubt whether to repair a column
with timestamp T (because two values X and Y are present within the cluster,
both at timestamp T), then we always repair towards X if md5(X)<md5(Y). In
this case, even after an inconsistency on the first insert, this would be
cleared by any node that triggers a repair afterwards.

And then you're done: a Cassandra can create a unique transaction ID by
inserting a column with the ID this clients wants to grab as key, and some
random stuff as value, then the clients reads the just inserted column, and
if the ID and the same random stuff is there - voila, the ID is unique for
this client.

-Roland



2010/4/29 Jonathan Ellis <jb...@gmail.com>

> 2010/4/28 Roland Hänel <ro...@haenel.me>:
> > Thanks Jonathan, that hits exactly the heart of my question.
> Unfortunately
> > it kills my original idea to implement a "unique transaction identifier
> > creation algorithm" - for this, even eventual consistency would be
> > sufficient, but I would need to know if I am consistent at the time of a
> > read request.
>
> right, for that kind of thing you'd really want to use contrib/mutex
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

Re: Detailed behavior of insert() operation?

Posted by Jonathan Ellis <jb...@gmail.com>.

2010/4/28 Roland Hänel <ro...@haenel.me>:
> Thanks Jonathan, that hits exactly the heart of my question. Unfortunately
> it kills my original idea to implement a "unique transaction identifier
> creation algorithm" - for this, even eventual consistency would be
> sufficient, but I would need to know if I am consistent at the time of a
> read request.

right, for that kind of thing you'd really want to use contrib/mutex

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Detailed behavior of insert() operation?

Posted by Roland Hänel <ro...@haenel.me>.

Thanks Jonathan, that hits exactly the heart of my question. Unfortunately
it kills my original idea to implement a "unique transaction identifier
creation algorithm" - for this, even eventual consistency would be
sufficient, but I would need to know if I am consistent at the time of a
read request.

One last question (sorry to bother you): isn't the behavior of read repair
strictly deterministic in this case? You say both read requests could try to
read repair the result (each time in the opposite direction). Inside the
read repair algorithm, when we have exactly the same timestamps, what value
is elected for repair? The first one that the node got in the read request?
If we make that deterministic, we could avoid this scenario, right?

-Roland



2010/4/28 Jonathan Ellis <jb...@gmail.com>

> 2010/4/28 Roland Hänel <ro...@haenel.me>:
> > Two clients insert the same key/colum with different values at the same
> > time:
> >
> >    client A does insert(keyspace, key_1,
> > column_name_1, value_A, timestamp_1, consistency_level.QUORUM)
> >    client B does insert(keyspace, key_1,
> > column_name_1, value_B, timestamp_1, consistency_level.QUORUM)
> >
> > After that, both clients read their value:
> >
> >    client A does
> > get(keyspace, key_1, column_name_1, consistency_level.QUORUM)
> >    client B does
> > get(keyspace, key_1, column_name_1, consistency_level.QUORUM)
> >
> > It is obvious that since the insert happens 'at the same time', i.e. with
> > the same timestamp, we cannot say
> > which value (value_A or value_B) gets written to the row. However, do we
> > have a guarantee that either value_A
> > or value_B is written, and that both read operations will return the same
> > result?
>
> The guarantee is that "eventually" you will get a consistent result.
>
> Say both writes overlap such that value A is present on replicas R1
> and R2, and value B is present on replica R3 (after both writes
> complete).
>
> Simultaneous read operations could then both attempt to "repair" the
> other nodes, and again there could be overlap, resulting in still 2
> values present, possibly on different nodes this time.
>
> So: you can see different values on reads when there are two
> "simultaneous" writes, and this can continue in the worst-case
> scenario until one read's repair can finish before another begins.
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

Re: Detailed behavior of insert() operation?

Posted by Jonathan Ellis <jb...@gmail.com>.

2010/4/28 Roland Hänel <ro...@haenel.me>:
> Two clients insert the same key/colum with different values at the same
> time:
>
>    client A does insert(keyspace, key_1,
> column_name_1, value_A, timestamp_1, consistency_level.QUORUM)
>    client B does insert(keyspace, key_1,
> column_name_1, value_B, timestamp_1, consistency_level.QUORUM)
>
> After that, both clients read their value:
>
>    client A does
> get(keyspace, key_1, column_name_1, consistency_level.QUORUM)
>    client B does
> get(keyspace, key_1, column_name_1, consistency_level.QUORUM)
>
> It is obvious that since the insert happens 'at the same time', i.e. with
> the same timestamp, we cannot say
> which value (value_A or value_B) gets written to the row. However, do we
> have a guarantee that either value_A
> or value_B is written, and that both read operations will return the same
> result?

The guarantee is that "eventually" you will get a consistent result.

Say both writes overlap such that value A is present on replicas R1
and R2, and value B is present on replica R3 (after both writes
complete).

Simultaneous read operations could then both attempt to "repair" the
other nodes, and again there could be overlap, resulting in still 2
values present, possibly on different nodes this time.

So: you can see different values on reads when there are two
"simultaneous" writes, and this can continue in the worst-case
scenario until one read's repair can finish before another begins.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com