You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Jérôme Verstrynge <jv...@gmail.com> on 2010/10/21 03:38:48 UTC

What happens if there is a collision?

Hi,

I am a new to Cassandra. I am reading all the documentation I can find 
online.

My question is the following:

-) Let's imagine a cluster with 5 nodes ABCDE. We know that quorum = 3.
-) Let's imagine a column called MyColumn.
-) Let's imagine current timestamp = 3567890.
-) Let's imagine node A updates MyColumn with value 'AAA' and timestamp 
3567890
-) Let's imagine node E updates MyColumn with value 'EEE' and timestamp 
3567890

What happens? Who wins? Is it deterministic?

Let's imagine node A performs 3 writes before node E, is any node 
notified of the collision?

Thanks,

JVerstry

Re: What happens if there is a collision?

Posted by Jérôme Verstrynge <jv...@gmail.com>.

Peter, many thanks for all this information.

On 26/10/2010 21:17, Peter Sculler wrote:
> It does mention that timestamps are used for conflict resolution but
> does not really dwell on the issue, and the remainder elides
> timestamps. So perhaps it's easy to miss. I also notice that the
> phrasing is such that it is not entirely unreasonably to interpret it
> like it seems you have.
Usually, I try to come with a fresh state of mind when learning about a 
new technology, but then again, I am contaminated with 'SQL' too... I 
made up my own interpretation and I did not find info that invalidates it...

I am glad we had all this conversation, because it reveals dead angles 
and it is pointing at what we can do it to facilitate 'deterministic 
Cassandra learning' (lol).

Somewhere down the road, I'll try to find some time to write a post on 
my blog to cover these issues. Something to the extend of: 'Introduction 
to timestamps in Cassandra'...

Jérôme

Re: What happens if there is a collision?

Posted by Brandon Williams <dr...@gmail.com>.

On Tue, Oct 26, 2010 at 2:17 PM, Peter Schuller <peter.schuller@infidyne.com
> wrote:
>
>  > ii) If case of timestamp ties, value breaks ties.
>
> If this is indeed intended to be a guarantee and not an artifact of
> the current implementation (anyone want to comment - jbellis?).
>
>
https://issues.apache.org/jira/browse/CASSANDRA-1039

-Brandon

Re: What happens if there is a collision?

Posted by Peter Schuller <pe...@infidyne.com>.

> I may have been unclear about the meaning of timestamp in Cassandra. I was
> under the impression that any given data with the same key value and two
> different timestamps would result in two 'rows'. From what you say, it does
> not seem to be the case. Do you confirm? (In other words, whoever has the
> greatest timestamp destroys the previous records with lower timestamps).

Yes (other than the use of the word "row"). An "insert" of a column (a
column being essentially a key/value pair) causes the key to be
associated with that value. If there was already a column with the
same key, it is replaced. If not, a column is added.

If you have a situation where conflicting writes cannot be allowed,
you'll either have to have some strong co-ordination of writers
outside of Cassandra or else "serialize" the problem by writing
intended changes to some kind of queue/data structure that some
particular guaranteed-to-be-alone Cassandra client processes in batch
mode independently (thereby avoiding the need for co-ordination).

> I know I am boxing a corner case, but I have not seen in the documentation
> that latest timestamp erases/overwrittes previous data. Now, I may have
> missed something here. May be I did not rub my eyes enough or the coffee was
> not operating yet.

I'm not sure where it's most clearly stated and I don't remember how I
figured these things out originally. I think the closest thing on the
wiki would be:

  http://wiki.apache.org/cassandra/DataModel

It does mention that timestamps are used for conflict resolution but
does not really dwell on the issue, and the remainder elides
timestamps. So perhaps it's easy to miss. I also notice that the
phrasing is such that it is not entirely unreasonably to interpret it
like it seems you have.

At the same time that page is somewhat of a mix between internal
models and the model exposed to clients, so I'm not sure how best to
improve the phrasing.

Riptano's recently added documentation may be worth reading:

   http://www.riptano.com/docs/0.6.5/index

Though upon cursory examination I'm not sure whether it is more clear
on this particular point.

> i) That most recent timestamp overwrittes previous entries with lower
> timestamp.

This can definitely be clarified.

> ii) If case of timestamp ties, value breaks ties.

If this is indeed intended to be a guarantee and not an artifact of
the current implementation (anyone want to comment - jbellis?).

> iii) What about ColumnFamilies and SuperColumnFamilies? Do we have the
> guarantee that, in case of timestamp  ties, the whole record of the winner
> is register (I would assume yes, of course)

Individual columns may be inserted into a SuperColumn so it is not
inserted as one compound value. If writers A and B both do concurrent
insertions to a SuperColumn where A writes column C1 and B writes
column C1 and C2, B's write of C2 will always stick, but C1 will be
subject to individual column conflict resolution. Keep in mind however
that typically timestamps are not allocated/chosen on a per-column
basis by a client. It does occur to me that at this point you may
actually have issues with the timestamp tie and value based conflict
resolution if you are expecting a set of column updates to either
apply or not apply as a group (with respect to some other group of
updates). That's a bit subtle.

Also on the topic of granularity, entire super columns and entire rows
may be deleted without individually referring to all columns. In those
cases, deletes span entire rows or supercolumns rather than individual
columns.

-- 
/ Peter Schuller

Re: What happens if there is a collision?

Posted by Jérôme Verstrynge <jv...@gmail.com>.

Peter, thanks for extensive feedback. Much appreciated.

On 26/10/2010 0:47, Peter Schuller wrote:
> This doesn't mean that your problem is somehow invalid; but it doesn't
> sound like QUOROM consistency (over-writing) writes is the solution.

> What is the difference, from your application's perspective, between
> the timestamp tie and a write simply happening a millisecond later by
> an un-coordinated concurrent writer? In both cases, the data in
> cassandra will no longer match your client's view of it.
I may have been unclear about the meaning of timestamp in Cassandra. I 
was under the impression that any given data with the same key value and 
two different timestamps would result in two 'rows'. From what you say, 
it does not seem to be the case. Do you confirm? (In other words, 
whoever has the greatest timestamp destroys the previous records with 
lower timestamps).

> I'm repeating myself but just to be clear: So again, it seems to me
> such an ACK would not be useful since you would not be made aware of
> any change that happens later on anyway. It does not seem semantically
> "relevant" except perhaps as a probabilistic optimization. As soon as
> your write completes, you have no idea what is in Cassandra,
> regardless of timestamp ties (assuming you have the potential for
> concurrent writers).
Assuming latest timestamp erase/overwrites previous entries, I agree.

>> If 'value breaks timestamp-tie', how does Cassandra behave in case of
>> updates? If there is a column with value 'AAA' at 334450 ms and an
>> application explicitely wants to update this value to 'ZZZ' for 334450 ms,
>> it seems like the timestamp-tie will prevent that. Hence, the
>> update/mutation would be undeterministic to E. It seems like one should
>> first delete the existing record and write a new one (and that could lead to
>> race conditions and timestamp-ties too).
> A single client wishing to make multiple logically subsequent writes
> should ensure that the same timestamp is not used for such writes.
Make sense if latest timestamp erases/overwrittes previous data.

>> I think this should be documented, because engineers will hit that 'local'
>> undeterministic issue for sure if two instances of their applications
>> perform 'completed writes' in the same column family. Completed does not
>> mean successful, even with quorum (or ALL). They ought to know it.
> I think it does. I believe the results you are describing as
> unexpected are fully expected fundamentally, and there is no real
> difference implied in receiving a timestamp ACK flag back. I'm totally
> open to being wrong or having misunderstood something (or both), but
> right now I don't see it. If on the other hand I'm not wrong then
> perhaps we can figure out how to document or present the functionality
> of Cassandra better :)
I know I am boxing a corner case, but I have not seen in the 
documentation that latest timestamp erases/overwrittes previous data. 
Now, I may have missed something here. May be I did not rub my eyes 
enough or the coffee was not operating yet.

If not, I would suggest adding some small documentation on the wiki 
explaining:

i) That most recent timestamp overwrittes previous entries with lower 
timestamp.
ii) If case of timestamp ties, value breaks ties.
iii) What about ColumnFamilies and SuperColumnFamilies? Do we have the 
guarantee that, in case of timestamp  ties, the whole record of the 
winner is register (I would assume yes, of course)

I believe something 'official' and explicit from Cassandra leaders would 
close gap on assumptions and interpretations made by newbies like me. 
Timestamp really looks like a 'key' to me.

Thanks,

Jérôme

Re: What happens if there is a collision?

Posted by Peter Schuller <pe...@infidyne.com>.

(sorry about the delay in responding - inbox backlog)

> REM: I am not trying to make this discussion longer than necessary or to
> play semantics. I am not in to that at all and I appreciate the time you
> take to answer me, really.

No problem; and same here. I just think that a mutual understanding
tends to be beneficial both ways ;)

> Here is where I disagree with your conclusion when there is a timestamp tie.
> The write by node E will not be performed successfully (at quorum level),
> because of the tie resolution in favor of A somewhere in all the nodes
> between A and E.
>
> Let's imagine that A initiates its column write at: 334450 ms with 'AAA' and
> timestamp 334450 ms
> Let's imagine that E initiates its column write at: 334451 ms with 'ZZZ'and
> timestamp 334450 ms
> (E is the latest write)
>
> Let's imagine that A reaches C at 334455 ms and performs its write.
> Let's imagine that E reaches C at 334456 ms and attempts to performs its
> write. It will loose the timestamp-tie ('AAA' is greater than 'ZZZ').
>
> Even if there is no further writting on that same column using timestamp
> 334450, a quorum read won't see that 'ZZZ' value (which is the latest
> attempt to write/update the column).
>
> Node A will have completed a write a QUOROM level.
> Node E will have completed a write a QUOROM level, but its value won't be
> registered and it won't be notified about it.
>
> Hence, I disagree with your conclusion that a quorum write implies that it
> was successfully written. It is not the case for E. I know we could play
> semantics about the meaning of 'successful write' here, but that would not
> lead us nowhere and that is not my point.

It goes to the definition of 'written'. One possibly definition of
'written' may be that 'if a value is written, it will be seen by a
subsequent read assuming it was not already re-written'. One example
here unrelated to cassandra is a write() in POSIX; if you can prove a
write() happened (and completed) prior to a read() on the same file
say, you are supposed to be guaranteed that the read() will see your
write(). But this does not mean that one cannot submit additional
writes that will over-write the data.

In the case of Cassandra and quorom writes, a similar situation occurs.

Having written a column at QUOROM, you are guaranteed to be able to
read that value back (at QUOROM) at a later time provided that it was
not deleted or over-written in the mean time. None of the sequence
above seems to violate that.

You seem to be after the read seeing your write of 'ZZZ'. But under
what definition of 'written' do you expect this to happen in the face
of concurrent writers? There is never a guarantee that the entire
history of data ever written will be readable in the future; an
overwrite is still an overwrite. Even with something like a local disk
and fsync() in between each write, you have this problem in the
absence of synchronization of readers and writers.

This doesn't mean that your problem is somehow invalid; but it doesn't
sound like QUOROM consistency (over-writing) writes is the solution.

> Here is what I am trying to do and why:
>
> If there is no timestamp-tie between A and E, then I have no issue.
>
> If there is a timestamp-tie, then the context becomes uncertain for E, out
> of the blue.
> If application E can't be sure about what has been saved in Cassandra, it
> cannot rely on what it has in memory. It is a vicious circle. It can't
> anticipate on the potential actions of A on the column too.
> This is unsual for any application, but may be this is the price to pay for
> using Cassandra. Fair enough.

The problem here is - how would your application *ever* know without
synchronization? The situation should be the same even without a
timestamp tie. In either case, you're writing something to Cassandra
and you know there may be concurrent writers. When the write call
completes, you will never know whether the data you wrote is the
"current" value of the data.

That is, unless you *do* have some form of synchronization which
allows you to guarantee (and know that it is guaranteed) that there is
no timestamp tie, and that your application is informed of other
writes with newer timestamps. But if you have this, it sounds like you
already have a synchronization mechanism?

(Now; Cassandra could support some kind of pub/sub to allow you to be
notified of changes relative to your written data. It doesn't, at the
moment. But I don't think the current behavior is incorrect with
respect to QUOROM consistency.)

> If E is not informed of the timestamp tie, then it is left alone in the
> dark. Hence, this is why I say Cassandra is not deterministic to E. The
> result of a write is potentially non-deterministic in what it actually
> performs.

To re-phase myself a bit: I claim that the result of the write is
non-deterministic in the above sense *anyway*, unless you have a
strictly synchronized concept of monotonically increasing time and the
ability to ascertain the relative order of a write with respect to
other writes in the cluster.

If you *do* have this, then yes, given identical timestamps you have a
problem you would not have otherwise (say with infinite resolution
time). But if you have this level of synchronization, can you perhaps
guarantee that no two writers ever choose the same timestamp instead?

> If E was aware that it lost a timestamp-tie, it would know that there is a
> possible gap between its internal memory representation and what it tried to
> save into Cassandra. That is, EVEN if there is no further write on that same
> column (or, in other words, regardless of any potential subsequent races).
>
> If E was informed it lost a timestamp-tie, it could re-read the column (and
> let's assume that there is no further write in between, but this does not
> change anything to the argument). It could spot that its write for timestamp
> value 334450 ms failed, and also the reason why ('AAA' greater than 'ZZZ).
> It could operate a new write, which eventually could result in another
> timestamp-tie, but at least it would be informed about it too... It would
> have a safety net.

What is the difference, from your application's perspective, between
the timestamp tie and a write simply happening a millisecond later by
an un-coordinated concurrent writer? In both cases, the data in
cassandra will no longer match your client's view of it.

> The case I am trying to cover is the case where the context for application
> E becomes invalid because of a successful write call to Cassandra without
> registration of 'ZZZ'. How can Cassandra call it a successful write, when in
> fact, it isn't for application E? I believe Cassandra should notify
> application E one way or another. This is why I mentioned an extra
> timestamp-tie flag in the write ACK sent by nodes back to node E.

I'm repeating myself but just to be clear: So again, it seems to me
such an ACK would not be useful since you would not be made aware of
any change that happens later on anyway. It does not seem semantically
"relevant" except perhaps as a probabilistic optimization. As soon as
your write completes, you have no idea what is in Cassandra,
regardless of timestamp ties (assuming you have the potential for
concurrent writers).

> If 'value breaks timestamp-tie', how does Cassandra behave in case of
> updates? If there is a column with value 'AAA' at 334450 ms and an
> application explicitely wants to update this value to 'ZZZ' for 334450 ms,
> it seems like the timestamp-tie will prevent that. Hence, the
> update/mutation would be undeterministic to E. It seems like one should
> first delete the existing record and write a new one (and that could lead to
> race conditions and timestamp-ties too).

A single client wishing to make multiple logically subsequent writes
should ensure that the same timestamp is not used for such writes.

> I think this should be documented, because engineers will hit that 'local'
> undeterministic issue for sure if two instances of their applications
> perform 'completed writes' in the same column family. Completed does not
> mean successful, even with quorum (or ALL). They ought to know it.

I think it does. I believe the results you are describing as
unexpected are fully expected fundamentally, and there is no real
difference implied in receiving a timestamp ACK flag back. I'm totally
open to being wrong or having misunderstood something (or both), but
right now I don't see it. If on the other hand I'm not wrong then
perhaps we can figure out how to document or present the functionality
of Cassandra better :)

-- 
/ Peter Schuller

Re: What happens if there is a collision?

Posted by Chris Dean <ct...@sokitomi.com>.

Peter Schuller <pe...@infidyne.com> writes:
>> The timestamp is an ever increasing clock so I wouldn't expect two api
>> calls from the same machine in the same thread to have the same
>> timestamp.  It is perfectly allowed behavior for the read value to not
>> agree with the write value.
>
> In the *particular* case of a single instantiation of a client I would
> tend to expect it to actually guarantee strictly increasing time just
> as a matter of thread-local consistency so that a single flow of
> control can assume that writes will happen in the order in which they
> are executed. (Is this actually the case for current high-level
> clients?)
>
> But of course, there is no such guarantee in the distributed sense
> either way.

The point is in reply to this message:

Jérôme Verstrynge <jv...@gmail.com> writes:
> You are making my point (lol). No matter what an application writes,
> it should re-read its owns write for determinism for a given timestamp
> when other application instances are writing in the same 'table'.

There is no such situation in Cassandra.  An application may read things
differently than it writes.  You may not hold the timestamp constant and
use that as a sort of locking mechanism.  The timestamp is an every
increasing clock.

Cheers,
Chris Dean

Re: What happens if there is a collision?

Posted by Peter Schuller <pe...@infidyne.com>.

> The timestamp is an ever increasing clock so I wouldn't expect two api
> calls from the same machine in the same thread to have the same
> timestamp.  It is perfectly allowed behavior for the read value to not
> agree with the write value.

In the *particular* case of a single instantiation of a client I would
tend to expect it to actually guarantee strictly increasing time just
as a matter of thread-local consistency so that a single flow of
control can assume that writes will happen in the order in which they
are executed. (Is this actually the case for current high-level
clients?)

But of course, there is no such guarantee in the distributed sense either way.

-- 
/ Peter Schuller

Re: What happens if there is a collision?

Posted by Chris Dean <ct...@sokitomi.com>.

Jérôme Verstrynge <jv...@gmail.com> writes:
> You are making my point (lol). No matter what an application writes,
> it should re-read its owns write for determinism for a given timestamp
> when other application instances are writing in the same 'table'.

The timestamp is an ever increasing clock so I wouldn't expect two api
calls from the same machine in the same thread to have the same
timestamp.  It is perfectly allowed behavior for the read value to not
agree with the write value.

Cheers,
Chris Dean

Re: What happens if there is a collision?

Posted by Jérôme Verstrynge <jv...@gmail.com>.

On 22/10/2010 2:27, Nicholas Knight wrote:
> On Oct 22, 2010, at 7:41 AM, Jérôme Verstrynge wrote:
>> Let's imagine that A initiates its column write at: 334450 ms with 'AAA' and timestamp 334450 ms
>> Let's imagine that E initiates its column write at: 334451 ms with 'ZZZ'and timestamp 334450 ms
>> (E is the latest write)
>>
>> Let's imagine that A reaches C at 334455 ms and performs its write.
>> Let's imagine that E reaches C at 334456 ms and attempts to performs its write. It will loose the timestamp-tie ('AAA' is greater than 'ZZZ').
> How is this any different from E's perspective than if A had come along a moment later with timestamp 334452?
If this results in only one entry, then I am happy. If this results in 
two entries (334450 and 334452), then the situation is different and 
does not correspond to my argument.

When I read http://wiki.apache.org/cassandra/DataModel, the column 
section explicitely says: "All values are supplied by the client, 
including the 'timestamp'."

Hence, there is nothing that explicitely guarantees that only one record 
is created from this documentation.

> What you describe is an application in *desperate* need of either a serious redesign, or a distributed locking mechanism.
>
> This really isn't a Cassandra-specific problem, Cassandra just happens to be the distributed storage system at issue. Any such system without a locking mechanism will present some form of this problem, and the answer will be the same: Avoid it in the application design, or incorporate a locking mechanism into the application.
I agree about the problem not being specific to Cassandra. I have 
nothing against Cassandra. In fact, I am facinated by it and consider 
using it in my own projects.

>> If there is a timestamp-tie, then the context becomes uncertain for E, out of the blue.
>> If application E can't be sure about what has been saved in Cassandra, it cannot rely on what it has in memory. It is a vicious circle. It can't anticipate on the potential actions of A on the column too.
> And how is this different from E's data being overwritten with a later timestamp? Either way, what E thinks is in Cassandra really isn't.
Well, E knows that it can't predict the value for future timestamps 
values coming from other nodes. Fine. What I am worried about is that it 
can't predict the value for its own timestamp.

> If you need to make sure you have consistency at this level, you *need* a locking mechanism.
>> This is unsual for any application, but may be this is the price to pay for using Cassandra. Fair enough.
> Hardly. Any non-serial application that doesn't use some form of locking has this exact same problem at all levels of storage, possibly even in its internal variables.
I have not argued against locking as a potential solution. I am only 
suggesting something lighter.

>> If E is not informed of the timestamp tie, then it is left alone in the dark. Hence, this is why I say Cassandra is not deterministic to E. The result of a write is potentially non-deterministic in what it actually performs.
> Cassandra is deterministic for a given input. What you're saying is you aren't properly controlling the input that your application is giving it.
You are making my point (lol). No matter what an application writes, it 
should re-read its owns write for determinism for a given timestamp when 
other application instances are writing in the same 'table'.

>> If E was aware that it lost a timestamp-tie, it would know that there is a possible gap between its internal memory representation and what it tried to save into Cassandra. That is, EVEN if there is no further write on that same column (or, in other words, regardless of any potential subsequent races).
> What is the significance of this?
If you know there is no timestamp collision, then you know you don't 
need to re-read for determinism. Otherwise you should. In a situation 
where you can't know, you should automatically re-read, which is 
expensive (or implement a locking mechanism).

>> If E was informed it lost a timestamp-tie, it could re-read the column (and let's assume that there is no further write in between, but this does not change anything to the argument). It could spot that its write for timestamp value 334450 ms failed, and also the reason why ('AAA' greater than 'ZZZ). It could operate a new write, which eventually could result in another timestamp-tie, but at least it would be informed about it too... It would have a safety net.
> To what end? A and E would apparently get into some sort of never-ending fight. The application as described is broken and needs to be fixed.
No, no fight since E would know it can't win because it has the lower 
hand 'ZZZ' for the given timestamp.

>> The case I am trying to cover is the case where the context for application E becomes invalid because of a successful write call to Cassandra without registration of 'ZZZ'. How can Cassandra call it a successful write, when in fact, it isn't for application E? I believe Cassandra should notify application E one way or another. This is why I mentioned an extra timestamp-tie flag in the write ACK sent by nodes back to node E.
> Here's part of the problem. You're seeing E as a distinct application from A which can behave completely independently. You need to stop thinking like that. It leads to broken architectures
>
> Even if the E and A processes come from entirely different code bases, you need to start by thinking of them as one application. That application is broken.
I am not going to argue this, because it is not related to my argument. 
I mean no offense by saying this.

>> The subsequent question I have is:
>>
>> If 'value breaks timestamp-tie', how does Cassandra behave in case of updates? If there is a column with value 'AAA' at 334450 ms and an application explicitely wants to update this value to 'ZZZ' for 334450 ms, it seems like the timestamp-tie will prevent that. Hence, the update/mutation would be undeterministic to E. It seems like one should first delete the existing record and write a new one (and that could lead to race conditions and timestamp-ties too).
> You need a locking mechanism. Timestamps aren't the droids you're looking for.
In this case, I do agree that explicit updates on a given timestamp 
can't be achieved without locks.

>> I think this should be documented, because engineers will hit that 'local' undeterministic issue for sure if two instances of their applications perform 'completed writes' in the same column family. Completed does not mean successful, even with quorum (or ALL). They ought to know it.
> I'm honestly not sure why they wouldn't. One need only perform a very cursory investigation of Cassandra to realize that addition of a locking mechanism is necessary for many applications, such as the one described here.
Again, I am not saying locks are not a solution. I was just suggesting a 
lighter solution for the issue I was raising. Implementing locks in 
Cassandra-like system is tricky. The proposed solutions so far are 
costly and heavy.
> -NK
Thanks for your answer.

Jérôme

Re: What happens if there is a collision?

Posted by Nicholas Knight <nk...@runawaynet.com>.

On Oct 22, 2010, at 7:41 AM, Jérôme Verstrynge wrote:

> Let's imagine that A initiates its column write at: 334450 ms with 'AAA' and timestamp 334450 ms
> Let's imagine that E initiates its column write at: 334451 ms with 'ZZZ'and timestamp 334450 ms
> (E is the latest write)
> 
> Let's imagine that A reaches C at 334455 ms and performs its write.
> Let's imagine that E reaches C at 334456 ms and attempts to performs its write. It will loose the timestamp-tie ('AAA' is greater than 'ZZZ').

How is this any different from E's perspective than if A had come along a moment later with timestamp 334452?

What you describe is an application in *desperate* need of either a serious redesign, or a distributed locking mechanism.

This really isn't a Cassandra-specific problem, Cassandra just happens to be the distributed storage system at issue. Any such system without a locking mechanism will present some form of this problem, and the answer will be the same: Avoid it in the application design, or incorporate a locking mechanism into the application.

> If there is a timestamp-tie, then the context becomes uncertain for E, out of the blue.
> If application E can't be sure about what has been saved in Cassandra, it cannot rely on what it has in memory. It is a vicious circle. It can't anticipate on the potential actions of A on the column too.

And how is this different from E's data being overwritten with a later timestamp? Either way, what E thinks is in Cassandra really isn't.

If you need to make sure you have consistency at this level, you *need* a locking mechanism.

> This is unsual for any application, but may be this is the price to pay for using Cassandra. Fair enough.

Hardly. Any non-serial application that doesn't use some form of locking has this exact same problem at all levels of storage, possibly even in its internal variables.

> 
> If E is not informed of the timestamp tie, then it is left alone in the dark. Hence, this is why I say Cassandra is not deterministic to E. The result of a write is potentially non-deterministic in what it actually performs.

Cassandra is deterministic for a given input. What you're saying is you aren't properly controlling the input that your application is giving it.

> If E was aware that it lost a timestamp-tie, it would know that there is a possible gap between its internal memory representation and what it tried to save into Cassandra. That is, EVEN if there is no further write on that same column (or, in other words, regardless of any potential subsequent races).

What is the significance of this?

> 
> If E was informed it lost a timestamp-tie, it could re-read the column (and let's assume that there is no further write in between, but this does not change anything to the argument). It could spot that its write for timestamp value 334450 ms failed, and also the reason why ('AAA' greater than 'ZZZ). It could operate a new write, which eventually could result in another timestamp-tie, but at least it would be informed about it too... It would have a safety net.

To what end? A and E would apparently get into some sort of never-ending fight. The application as described is broken and needs to be fixed.

> 
> The case I am trying to cover is the case where the context for application E becomes invalid because of a successful write call to Cassandra without registration of 'ZZZ'. How can Cassandra call it a successful write, when in fact, it isn't for application E? I believe Cassandra should notify application E one way or another. This is why I mentioned an extra timestamp-tie flag in the write ACK sent by nodes back to node E.

Here's part of the problem. You're seeing E as a distinct application from A which can behave completely independently. You need to stop thinking like that. It leads to broken architectures

Even if the E and A processes come from entirely different code bases, you need to start by thinking of them as one application. That application is broken.

> 
> The subsequent question I have is:
> 
> If 'value breaks timestamp-tie', how does Cassandra behave in case of updates? If there is a column with value 'AAA' at 334450 ms and an application explicitely wants to update this value to 'ZZZ' for 334450 ms, it seems like the timestamp-tie will prevent that. Hence, the update/mutation would be undeterministic to E. It seems like one should first delete the existing record and write a new one (and that could lead to race conditions and timestamp-ties too).

You need a locking mechanism. Timestamps aren't the droids you're looking for.

> I think this should be documented, because engineers will hit that 'local' undeterministic issue for sure if two instances of their applications perform 'completed writes' in the same column family. Completed does not mean successful, even with quorum (or ALL). They ought to know it.

I'm honestly not sure why they wouldn't. One need only perform a very cursory investigation of Cassandra to realize that addition of a locking mechanism is necessary for many applications, such as the one described here.

-NK

Re: What happens if there is a collision?

Posted by Jérôme Verstrynge <jv...@gmail.com>.

On 21/10/2010 23:40, Peter Schuller wrote:
>> OK. Thanks for your answer. From an email exchange I had with Jonathan, all
>> this means that one should re-read its writes with quorum to make sure they
>> have not been overriden by timestamp-tie conflicts. I suggested to send
>> feedback to writting node (in the ACK) when such timestamps-tie conflict
>> happen. This would avoid having to double-check all writes for timestamp-tie
>> conflicts.
>>
>> If multiple applications write to the same ColumnFamily/Tables, this
>> double-check is a must (unless a separate locking mecanism is implemented,
>> which would be more heavy).
> I'm not sure I understand what you're trying to accomplish. Given that
> you have no locking/synchronization mechanism external to Cassandra,
> what is it that you are actually learning from re-reading the value? A
> completed write at level QUOROM means it was successfully written and
> that readers reading at QUOROM will see it unless the value has been
> updated subsequently.
REM: I am not trying to make this discussion longer than necessary or to 
play semantics. I am not in to that at all and I appreciate the time you 
take to answer me, really.

Here is where I disagree with your conclusion when there is a timestamp 
tie. The write by node E will not be performed successfully (at quorum 
level), because of the tie resolution in favor of A somewhere in all the 
nodes between A and E.

Let's imagine that A initiates its column write at: 334450 ms with 'AAA' 
and timestamp 334450 ms
Let's imagine that E initiates its column write at: 334451 ms with 
'ZZZ'and timestamp 334450 ms
(E is the latest write)

Let's imagine that A reaches C at 334455 ms and performs its write.
Let's imagine that E reaches C at 334456 ms and attempts to performs its 
write. It will loose the timestamp-tie ('AAA' is greater than 'ZZZ').

Even if there is no further writting on that same column using timestamp 
334450, a quorum read won't see that 'ZZZ' value (which is the latest 
attempt to write/update the column).

Node A will have completed a write a QUOROM level.
Node E will have completed a write a QUOROM level, but its value won't 
be registered and it won't be notified about it.

Hence, I disagree with your conclusion that a quorum write implies that 
it was successfully written. It is not the case for E. I know we could 
play semantics about the meaning of 'successful write' here, but that 
would not lead us nowhere and that is not my point.

> But even if you re-read, that does not remove
> the fundamental potential for a race condition (i.e., you still don't
> know when you see the result of your read whether it wasn't just
> ovewritten anyway just after you did your read).
>
> Perhaps I'm misunderstanding what you're trying to do?
I totally agree there is a risk of race condition.

Here is what I am trying to do and why:

If there is no timestamp-tie between A and E, then I have no issue.

If there is a timestamp-tie, then the context becomes uncertain for E, 
out of the blue.
If application E can't be sure about what has been saved in Cassandra, 
it cannot rely on what it has in memory. It is a vicious circle. It 
can't anticipate on the potential actions of A on the column too.
This is unsual for any application, but may be this is the price to pay 
for using Cassandra. Fair enough.

If E is not informed of the timestamp tie, then it is left alone in the 
dark. Hence, this is why I say Cassandra is not deterministic to E. The 
result of a write is potentially non-deterministic in what it actually 
performs.

If E was aware that it lost a timestamp-tie, it would know that there is 
a possible gap between its internal memory representation and what it 
tried to save into Cassandra. That is, EVEN if there is no further write 
on that same column (or, in other words, regardless of any potential 
subsequent races).

If E was informed it lost a timestamp-tie, it could re-read the column 
(and let's assume that there is no further write in between, but this 
does not change anything to the argument). It could spot that its write 
for timestamp value 334450 ms failed, and also the reason why ('AAA' 
greater than 'ZZZ). It could operate a new write, which eventually could 
result in another timestamp-tie, but at least it would be informed about 
it too... It would have a safety net.

The case I am trying to cover is the case where the context for 
application E becomes invalid because of a successful write call to 
Cassandra without registration of 'ZZZ'. How can Cassandra call it a 
successful write, when in fact, it isn't for application E? I believe 
Cassandra should notify application E one way or another. This is why I 
mentioned an extra timestamp-tie flag in the write ACK sent by nodes 
back to node E.

The subsequent question I have is:

If 'value breaks timestamp-tie', how does Cassandra behave in case of 
updates? If there is a column with value 'AAA' at 334450 ms and an 
application explicitely wants to update this value to 'ZZZ' for 334450 
ms, it seems like the timestamp-tie will prevent that. Hence, the 
update/mutation would be undeterministic to E. It seems like one should 
first delete the existing record and write a new one (and that could 
lead to race conditions and timestamp-ties too).

My conclusion so far is that a timestamp-tie boolean would help 
resolving potentially non-deterministic situations which can appear 
randomly at any time. Implementing locks would completely prevent these 
situations, but then, locks should be implemented for all writes on all 
tables if two application instance have access to it. It is a 
light/inexpensive versus heavy/costly safety net situation.

I think this should be documented, because engineers will hit that 
'local' undeterministic issue for sure if two instances of their 
applications perform 'completed writes' in the same column family. 
Completed does not mean successful, even with quorum (or ALL). They 
ought to know it.

Jérôme

Re: What happens if there is a collision?

Posted by Peter Schuller <pe...@infidyne.com>.

> OK. Thanks for your answer. From an email exchange I had with Jonathan, all
> this means that one should re-read its writes with quorum to make sure they
> have not been overriden by timestamp-tie conflicts. I suggested to send
> feedback to writting node (in the ACK) when such timestamps-tie conflict
> happen. This would avoid having to double-check all writes for timestamp-tie
> conflicts.
>
> If multiple applications write to the same ColumnFamily/Tables, this
> double-check is a must (unless a separate locking mecanism is implemented,
> which would be more heavy).

I'm not sure I understand what you're trying to accomplish. Given that
you have no locking/synchronization mechanism external to Cassandra,
what is it that you are actually learning from re-reading the value? A
completed write at level QUOROM means it was successfully written and
that readers reading at QUOROM will see it unless the value has been
updated subsequently. But even if you re-read, that does not remove
the fundamental potential for a race condition (i.e., you still don't
know when you see the result of your read whether it wasn't just
ovewritten anyway just after you did your read).

Perhaps I'm misunderstanding what you're trying to do?

-- 
/ Peter Schuller

Re: What happens if there is a collision?

Posted by Jérôme Verstrynge <jv...@gmail.com>.

On 21/10/2010 20:03, Peter Schuller wrote:
>> My question is: is node E notified that it lost the battle against A? If yes
>> how?
>>
>> If not, then it means that, although writes are atomic, they would not be
>> deterministic. Node E would have to verify that its write was successful...
> Quorom is not really a special case in that sense; it just tells how
> many nodes must ack the operation. The conflict resolution of columns
> proceed as usual and data replicate as usual. Node E would do the same
> conflict resolution as other nodes whenever it sees the write,
> regardless of that was because it was in the original chosen quorom
> set, or it was given the right a bit later in the background, or it
> happened during anti-entropy etc.

OK. Thanks for your answer. From an email exchange I had with Jonathan, 
all this means that one should re-read its writes with quorum to make 
sure they have not been overriden by timestamp-tie conflicts. I 
suggested to send feedback to writting node (in the ACK) when such 
timestamps-tie conflict happen. This would avoid having to double-check 
all writes for timestamp-tie conflicts.

If multiple applications write to the same ColumnFamily/Tables, this 
double-check is a must (unless a separate locking mecanism is 
implemented, which would be more heavy).

Jérôme

Re: What happens if there is a collision?

Posted by Peter Schuller <pe...@infidyne.com>.

> My question is: is node E notified that it lost the battle against A? If yes
> how?
>
> If not, then it means that, although writes are atomic, they would not be
> deterministic. Node E would have to verify that its write was successful...

Quorom is not really a special case in that sense; it just tells how
many nodes must ack the operation. The conflict resolution of columns
proceed as usual and data replicate as usual. Node E would do the same
conflict resolution as other nodes whenever it sees the write,
regardless of that was because it was in the original chosen quorom
set, or it was given the right a bit later in the background, or it
happened during anti-entropy etc.

-- 
/ Peter Schuller

Re: What happens if there is a collision?

Posted by Jérôme Verstrynge <jv...@gmail.com>.

On 21/10/2010 4:43, Jonathan Ellis wrote:
> On Wed, Oct 20, 2010 at 8:38 PM, Jérôme Verstrynge<jv...@gmail.com>  wrote:
>> -) Let's imagine node A updates MyColumn with value 'AAA' and timestamp
>> 3567890
>> -) Let's imagine node E updates MyColumn with value 'EEE' and timestamp
>> 3567890
>>
>> What happens? Who wins? Is it deterministic?
> value breaks ties if timestamps are identical, so AAA would win.
OK. Thanks for your quick answer.

>> Let's imagine node A performs 3 writes before node E, is any node notified
>> of the collision?
> I don't understand the question.
If there is a timestamp tie, then 'AAA' wins. So node E's call to 
Cassandra's write method will return without being performed, because 
node A's 'AAA' won the timestamp tie.

My question is: is node E notified that it lost the battle against A? If 
yes how?

If not, then it means that, although writes are atomic, they would not 
be deterministic. Node E would have to verify that its write was 
successful...

Jérôme

Re: What happens if there is a collision?

Posted by Jonathan Ellis <jb...@gmail.com>.

On Wed, Oct 20, 2010 at 8:38 PM, Jérôme Verstrynge <jv...@gmail.com> wrote:
> -) Let's imagine node A updates MyColumn with value 'AAA' and timestamp
> 3567890
> -) Let's imagine node E updates MyColumn with value 'EEE' and timestamp
> 3567890
>
> What happens? Who wins? Is it deterministic?

value breaks ties if timestamps are identical, so AAA would win.

> Let's imagine node A performs 3 writes before node E, is any node notified
> of the collision?

I don't understand the question.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com