You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Anthony John <ch...@gmail.com> on 2011/02/24 03:22:15 UTC

New Chain for : Does Cassandra use vector clocks

Apologies : For some reason my response on the original mail keeps bouncing
back, thus this new one!
> From the other hand, the same article says:
> "For conditional writes to work, the condition must be evaluated at all
update
> sites before the write can be allowed to succeed."
>
> This means, that when doing such an update CL=ALL must be used

Sorry, but I am confused by that entire thread!

Questions:-
1. Does Cassandra implement any kind of data locking - at any granularity
whether it be row/colF/Col ?
2. If the answer to 1 above is NO! - how does CL ALL prevent conflicts.
Concurrent updates on exactly the same piece of data on different nodes can
still mess each other up, right ?

-JA

Re: New Chain for : Does Cassandra use vector clocks

Posted by Oleg Anastasyev <ol...@gmail.com>.

Sylvain Lebresne <sylvain <at> datastax.com> writes:

> However, if that simple conflict detection/resolution mechanism is not good 
enough for some of your use case and you need to keep two concurrent updates, it 
is easy enough. Just make sure that the update don't end up in the same column. 
This is easily achieved by appending some unique identifier to the column name 
for instance. And when reading, do a slice and reconcile whatever you get back 
with whatever logic make sense. If you do that, congrats, you've roughly 
emulated what vector clocks would do. Btw, no locking or anything needed.

This solution is (much?) worse, than having vector clocks. It multiplies the 
amount of data and load to your system, forcing you to throw more nodes to the 
cluster, because:
* Number of columns at least doubles. Or even worse, if you cannot predict 
number of simultaneous processes accessing the same column, because you need 
then to add unique postfixes to columns of each of update, making them 
efficiently not updates, but inserts. If you have dataset, which updates often, 
you'll multiply number of columns and, so, the data size, by number of updates 
to your dataset. 
* These columns with uniq postfixes need to be merged somehow. Cassandra has 
nice background merge facility - named compaction - but it cannot work on such 
dataset, becase there is nothing to compact - every column is unique and has no 
overwritten generation.
* So, anyway, merge must be done - because logically this is still single 
column. And the only way is to read all columns with some prefix using get_slice 
call and resolve conflicts manually, returning freshest copy to client and 
deteling obsolete data. This makes app code complex, triggers additional load on 
cassandra cluster (it must do RR for several columns now instead of 1), triggers 
additional operations  (deletes of obsolete values).
* And finally, deleting obsolete data actually dont free space for GCPeriodTime. 
So your disks will be full, storing obsolete data for prolonged time.

In contrast, having vector clocks is more effective solution. It does not 
duplicates column names and values several times, it duplicates only timestamp 
by the number of your RF. And your logically single column is handled as single.

Re: New Chain for : Does Cassandra use vector clocks

Posted by Dave Revell <da...@meebo-inc.com>.

>Time stamps are not used for conflict resolution - unless is is part of the
application logic!!!

This is false. In fact, the main reason Cassandra keeps timestamps is to do
conflict resolution. If there is a conflict between two replicas, when doing
a read or a repair, then the highest timestamp always wins.

Example: say your replication factor is 5. So if you read at CL ALL, you
will ask 5 replicas for their value. If the value from only one of these
replicas has a timestamp that is newer than all the rest, this is the value
that will be retruned to the client. There is no "voting" scheme where the
most common value wins, the conflict resolution is based ONLY on the most
recent timestamp.

(irrelevant aside: in the above example, read repair would occur at the end,
after the different values were detected by the coordinating server)

Clients are free to use the timestamps for their own purposes, but clients
must be careful to choose timestamps that make Cassandra do the right thing
during conflict resolution.

Best,
Dave

On Thu, Feb 24, 2011 at 8:34 AM, Anthony John <ch...@gmail.com> wrote:

> >>Time stamps are not used for conflict resolution - unless is is part of
>> the application logic!!!
>>
>
> >>What is you definition of conflict resolution ? Because if you update
> twice the same column (which
> >>I'll call a conflict), then the timestamps are used to decide which
> update wins (which I'll call a resolution).
>
> I understand what you are saying, and yes semantics is very important here.
> And yes we are responding to the immediate questions without covering all
> questions in the thread.
>
> The point being made here is that the timestamp of the column is not used
> by Cassandra to figure out what data to return.
>
> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
> A Quorum  Write comes and add/updates the time stamp (TS2) of a particular
> data element. It succeeds on N1 - fails on N2/3. So the write is returned as
> failed - right ?
> Now Quorum read comes in for exactly the same piece of data that the write
> failed for.
> So N1 has TS2 but both N2/3 have the old TS (say TS1)
> And the read succeeds - Will it return TS1 or TS2.
>
> I submit it will return TS1 - the old TS.
>
> Are we on the same page with this interpretation ?
>
> Regards,
>
> -JA
>
> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne <sy...@datastax.com>wrote:
>
>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John <ch...@gmail.com>wrote:
>>
>>> Sylvan,
>>>
>>> Time stamps are not used for conflict resolution - unless is is part of
>>> the application logic!!!
>>>
>>
>> What is you definition of conflict resolution ? Because if you update
>> twice the same column (which
>> I'll call a conflict), then the timestamps are used to decide which update
>> wins (which I'll call a resolution).
>>
>>
>>> You can have "lost updates" w/Cassandra. You need to to use 3rd products
>>> - cages for e.g. - to get ACID type consistency.
>>>
>>
>> Then again, you'll have to define what you are calling "lost updates".
>> Provided you use a reasonable consistency level, Cassandra provides fairly
>> strong durability guarantee, so for some definition you don't "lose
>> updates".
>>
>> That being said, I never pretended that Cassandra provided any ACID
>> guarantee. ACID relates to transaction, which Cassandra doesn't support. If
>> we're talking about the guarantees of transaction, then by all means,
>> cassandra won't provide it. And yes you can use cages or the like to get
>> transaction. But that was not the point of the thread, was it ? The thread
>> is about vector clocks, and that has nothing to do with transaction (vector
>> clocks certainly don't give you transactions).
>>
>> Sorry if I wasn't clear in my mail, but I was only responding to why so
>> far I don't think vector clocks would really provide much for Cassandra.
>>
>> --
>> Sylvain
>>
>>
>>> -JA
>>>
>>>
>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne <sy...@datastax.com>wrote:
>>>
>>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John <ch...@gmail.com>wrote:
>>>>
>>>>> Apologies : For some reason my response on the original mail keeps
>>>>> bouncing back, thus this new one!
>>>>> > From the other hand, the same article says:
>>>>> > "For conditional writes to work, the condition must be evaluated at
>>>>> all update
>>>>> > sites before the write can be allowed to succeed."
>>>>> >
>>>>> > This means, that when doing such an update CL=ALL must be used
>>>>>
>>>>> Sorry, but I am confused by that entire thread!
>>>>>
>>>>> Questions:-
>>>>> 1. Does Cassandra implement any kind of data locking - at any
>>>>> granularity whether it be row/colF/Col ?
>>>>>
>>>>
>>>> No locking, no.
>>>>
>>>>
>>>>> 2. If the answer to 1 above is NO! - how does CL ALL prevent conflicts.
>>>>> Concurrent updates on exactly the same piece of data on different nodes can
>>>>> still mess each other up, right ?
>>>>>
>>>>
>>>> Not sure why you are taking CL.ALL specifically. But in any CL, updating
>>>> the same piece of data means the same column value. In that case, the
>>>> resolution rules are the following:
>>>>    - If the updates have a different timestamp, keep the one with the
>>>> higher timestamp. That is, the more recent of two updates win.
>>>>   - It the timestamps are the same, then it compares the values (byte
>>>> comparison) and keep the highest value. This is just to break ties in a
>>>> consistent manner.
>>>>
>>>> So if you do two truly concurrent updates (that is from two place at the
>>>> same instant), then you'll end with one of the update. This is the column
>>>> level.
>>>>
>>>> However, if that simple conflict detection/resolution mechanism is not
>>>> good enough for some of your use case and you need to keep two concurrent
>>>> updates, it is easy enough. Just make sure that the update don't end up in
>>>> the same column. This is easily achieved by appending some unique identifier
>>>> to the column name for instance. And when reading, do a slice and reconcile
>>>> whatever you get back with whatever logic make sense. If you do that,
>>>> congrats, you've roughly emulated what vector clocks would do. Btw, no
>>>> locking or anything needed.
>>>>
>>>> In my experience, for most things the timestamp resolution is enough. If
>>>> the same user update twice it's profile picture on you web site at the same
>>>> microsecond, it's usually fine to end up with one of the two pictures. In
>>>> the rare case where you need something more specific, using the cassandra
>>>> data model usually solves the problem easily. The reason for not having
>>>> vector clocks in Cassandra is that so far, we haven't really found much
>>>> example where it is no the case.
>>>>
>>>> --
>>>> Sylvain
>>>>
>>>>
>>>
>>
>

Re: New Chain for : Does Cassandra use vector clocks

Posted by Jeremy Hanna <je...@gmail.com>.

Yeah - no worries - I don't think anyone was thinking you were trying to drink kool-aid or selling anything.  Jonathan was just pointing out thoughtful replies to his claims.

This past year, Michael Stonebraker with voltdb and other things seems to have tried to take advantage of momentum behind systems like cassandra (as well as the backlash against nosql) to make pretty bold claims, especially when considering that volt is an in memory database.  So 1) he's kind of been using his pedigree as credibility in selling a new product and 2) the voltdb marketing department makes heavy use of buzz words and hyperbole.

Nothing wrong with voltdb necessarily, it probably has its uses.  However, the way it's been pitched by the company and by Stonebraker in particular seems disingenuous, self-serving, and to me has very much tarnished his reputation as an objective luminary in the field of computer science.

Maybe I'm taking that too far, but now every time I hear a statement by him, I have a grain of salt at the ready.

On Feb 25, 2011, at 10:21 AM, A J wrote:

> Though you are not really implying that, I am not selling anything. I
> don't work for VoltDB. I had other issues for my use case with the
> software when I was evaluating it (their claim of durability is weak
> according to me. Though it does not matter I'd rather they call
> themselves NOSQL. they just give lip-service to SQL)
> I'd rather not drink any sort of kool-aid, get all sides (whatever the
> motive of the sides be) and be the judge myself for what I want to do.
> 
> The thread was by someone who seems to be having difficulty wrapping
> head around the gives and takes of cassandra. maybe something else is
> better for their use case.
> 
> Peace :)
> 
> 
> On Fri, Feb 25, 2011 at 10:39 AM, Jonathan Ellis <jb...@gmail.com> wrote:
>> That article is heavily biased by "I am selling a competitor to Cassandra."
>> 
>> First, read Coda's original piece if you haven't:
>> http://codahale.com/you-cant-sacrifice-partition-tolerance/
>> 
>> Then, Jeff Darcy's response: http://pl.atyp.us/wordpress/?p=3110
>> 
>> On Thu, Feb 24, 2011 at 2:56 PM, A J <s5...@gmail.com> wrote:
>>> While we are at it, there's more to consider than just CAP in distributed :)
>>> http://voltdb.com/blog/clarifications-cap-theorem-and-data-related-errors
>>> 
>>> On Thu, Feb 24, 2011 at 3:31 PM, Edward Capriolo <ed...@gmail.com> wrote:
>>>> On Thu, Feb 24, 2011 at 3:03 PM, A J <s5...@gmail.com> wrote:
>>>>> yes, that is difficult to digest and one has to be sure if the use
>>>>> case can afford it.
>>>>> 
>>>>> Some other NOSQL databases deals with it differently (though I don't
>>>>> think any of them use atomic 2-phase commit). MongoDB for example will
>>>>> ask you to read from the node you wrote first (primary node) unless
>>>>> you are ok with eventual consistency. If the write did not make to
>>>>> majority of other nodes, it will be rolled-back from the original
>>>>> primary when it comes up again as a secondary.
>>>>> In some cases, you still could server either new value (that was
>>>>> returned as failed) or the old one. But it is different from Cassandra
>>>>> in the sense that Cassandra will never rollback.
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Feb 24, 2011 at 2:47 PM, Anthony John <ch...@gmail.com> wrote:
>>>>>> The leap of faith here is that an error does not mean a clean backing out to
>>>>>> prior state - as we are used to with databases. It means that the operation
>>>>>> in error could have gone through partially
>>>>>> 
>>>>>> Again, this is not an absolutely unfamiliar territory and can be dealt with.
>>>>>> -JA
>>>>>> On Thu, Feb 24, 2011 at 1:16 PM, A J <s5...@gmail.com> wrote:
>>>>>>> 
>>>>>>>>> but could be broken in case of a failed write<<
>>>>>>> You can think of a scenario where R + W >N still leads to
>>>>>>> inconsistency even for successful writes. Say you keep W=1 and R=N .
>>>>>>> Lets say the one node where a write happened with success goes down
>>>>>>> before it made to the other N-1 nodes. Lets say it goes down for good
>>>>>>> and is unrecoverable. The only option is to build a new node from
>>>>>>> scratch from other active nodes. This will lead to a write that was
>>>>>>> lost and you will end up serving stale copy of it.
>>>>>>> 
>>>>>>> It is better to talk in terms of use cases and if cassandra will be a
>>>>>>> fit for it. Otherwise unless you have W=R=N and fsync before each
>>>>>>> write commit, there will be scope for inconsistency.
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Feb 24, 2011 at 1:25 PM, Anthony John <ch...@gmail.com>
>>>>>>> wrote:
>>>>>>>> I see the point - apologies for putting everyone through this!
>>>>>>>> It was just militating against my mental model.
>>>>>>>> In summary, here is my take away - simple stuff but - IMO - important to
>>>>>>>> conclude this thread (I hope):-
>>>>>>>> 1. I was splitting hair over a failed ( partial ) Q Write. Such an event
>>>>>>>> should be immediately followed by the same write going to a connection
>>>>>>>> on to
>>>>>>>> another node ( potentially using connection caches of client
>>>>>>>> implementations
>>>>>>>> ) or a Read at CL of All. Because a write could have partially gone
>>>>>>>> through.
>>>>>>>> 2. Timestamps are used in determining the latest version ( correcting
>>>>>>>> the
>>>>>>>> false impression I was propagating)
>>>>>>>> Finally, wrt "W + R > N for Q CL statement" holds, but could be broken
>>>>>>>> in
>>>>>>>> case of a failed write as it is unsure whether the new value got written
>>>>>>>> on
>>>>>>>>  any server or not. Is that a fair characterization ?
>>>>>>>> Bottom line - unlike traditional DBMS, errors do not ensure automatic
>>>>>>>> cleanup and revert back, app code has to follow up if  immediate - and
>>>>>>>> not
>>>>>>>> eventual -  consistency is desired. I made that leap in almost all cases
>>>>>>>> - I
>>>>>>>> think - but the case of a failed write.
>>>>>>>> My bad and I can live with this!
>>>>>>>> Regards,
>>>>>>>> -JA
>>>>>>>> 
>>>>>>>> On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne
>>>>>>>> <sy...@datastax.com>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John <ch...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Completely understand!
>>>>>>>>>> All that I am quibbling over is whether a CL of quorum guarantees
>>>>>>>>>> consistency or not. That is what the documentation says - right. IF
>>>>>>>>>> for a CL
>>>>>>>>>> of Q read - it depends on which node returns read first to determine
>>>>>>>>>> the
>>>>>>>>>> actual returned result or other more convoluted conditions , then a
>>>>>>>>>> Quorum
>>>>>>>>>> read/write is not consistent, by any definition.
>>>>>>>>> 
>>>>>>>>> But that's the point. The definition of consistency we are talking
>>>>>>>>> about
>>>>>>>>> has no meaning if you consider only a quorum read. The definition
>>>>>>>>> (which is
>>>>>>>>> the de facto definition of consistency in 'eventually consistent') make
>>>>>>>>> sense if we talk about a write followed by a read. And it is
>>>>>>>>> considering succeeding write followed by succeeding read.
>>>>>>>>> And that is the statement the wiki is making.
>>>>>>>>> Honestly, we could debate forever on the definition of consistency and
>>>>>>>>> whatnot. Cassandra guaranties that if you do a (succeeding) write on W
>>>>>>>>> replica and then a (succeeding) read on R replica and if R+W>N, then it
>>>>>>>>> is
>>>>>>>>> guaranteed that the read will see the preceding write. And this is what
>>>>>>>>> is
>>>>>>>>> called consistency in the context of eventual consistency (which is not
>>>>>>>>> the
>>>>>>>>> context of ACID).
>>>>>>>>> If this is not the definition of consistency you had in mind then by
>>>>>>>>> all
>>>>>>>>> mean, Cassandra probably don't guarantee this definition. But given
>>>>>>>>> that the
>>>>>>>>> paragraph preceding what you pasted state clearly we are not talking
>>>>>>>>> about
>>>>>>>>> ACID consistency, but eventual consistency, I don't think the wiki is
>>>>>>>>> making
>>>>>>>>> any unfair statement.
>>>>>>>>> That being said, the wiki may not be always as clear as it could. But
>>>>>>>>> it's
>>>>>>>>> an editable wiki :)
>>>>>>>>> --
>>>>>>>>> Sylvain
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I can still use Cassandra, and will use it, luv it!!! But let us not
>>>>>>>>>> make
>>>>>>>>>> this statement on the Wiki architecture section:-
>>>>>>>>>> -------------------------------------------------------------
>>>>>>>>>> 
>>>>>>>>>> More specifically: R=read replica count W=write replica
>>>>>>>>>> count N=replication factor Q=QUORUM (Q = N / 2 + 1)
>>>>>>>>>> 
>>>>>>>>>> If W + R > N, you will have consistency
>>>>>>>>>> 
>>>>>>>>>> W=1, R=N
>>>>>>>>>> W=N, R=1
>>>>>>>>>> W=Q, R=Q where Q = N / 2 + 1
>>>>>>>>>> 
>>>>>>>>>> Cassandra provides consistency when R + W > N (read replica count
>>>>>>>>>> + write
>>>>>>>>>> replica count > replication factor).
>>>>>>>>>> 
>>>>>>>>>> ----------------------------------------------------
>>>>>>>>>> 
>>>>>>>>>> .
>>>>>>>>>> 
>>>>>>>>>> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne
>>>>>>>>>> <sy...@datastax.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John <ch...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> If you are correct and you are probably closer to the code - then CL
>>>>>>>>>>>> of
>>>>>>>>>>>> Quorum does not guarantee a consistency.
>>>>>>>>>>> 
>>>>>>>>>>> If the operation succeed, it does (for some definition of consistency
>>>>>>>>>>> which is, following reads at Quorum will be guaranteed to see the new
>>>>>>>>>>> value
>>>>>>>>>>> of a update at quorum). If it fails, then no, it does not guarantee
>>>>>>>>>>> consistency.
>>>>>>>>>>> It is important to note that the word consistency has multiple
>>>>>>>>>>> meaning.
>>>>>>>>>>> In particular, when we are talking of consistency in Cassandra, we
>>>>>>>>>>> are not
>>>>>>>>>>> talking of the same definition as the C in ACID
>>>>>>>>>>> 
>>>>>>>>>>> (see: http://www.allthingsdistributed.com/2007/12/eventually_consistent.html)
>>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne
>>>>>>>>>>>> <sy...@datastax.com> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John
>>>>>>>>>>>>> <ch...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Time stamps are not used for conflict resolution - unless is is
>>>>>>>>>>>>>>>>> part of the application logic!!!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> What is you definition of conflict resolution ? Because if you
>>>>>>>>>>>>>>>> update twice the same column (which
>>>>>>>>>>>>>>>> I'll call a conflict), then the timestamps are used to decide
>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>> update wins (which I'll call a resolution).
>>>>>>>>>>>>>> I understand what you are saying, and yes semantics is very
>>>>>>>>>>>>>> important
>>>>>>>>>>>>>> here. And yes we are responding to the immediate questions without
>>>>>>>>>>>>>> covering
>>>>>>>>>>>>>> all questions in the thread.
>>>>>>>>>>>>>> The point being made here is that the timestamp of the column is
>>>>>>>>>>>>>> not
>>>>>>>>>>>>>> used by Cassandra to figure out what data to return.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Not quite true.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
>>>>>>>>>>>>>> A Quorum  Write comes and add/updates the time stamp (TS2) of a
>>>>>>>>>>>>>> particular data element. It succeeds on N1 - fails on N2/3. So the
>>>>>>>>>>>>>> write is
>>>>>>>>>>>>>> returned as failed - right ?
>>>>>>>>>>>>>> Now Quorum read comes in for exactly the same piece of data that
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> write failed for.
>>>>>>>>>>>>>> So N1 has TS2 but both N2/3 have the old TS (say TS1)
>>>>>>>>>>>>>> And the read succeeds - Will it return TS1 or TS2.
>>>>>>>>>>>>>> I submit it will return TS1 - the old TS.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> It all depends on which (first 2) nodes respond to the read (since
>>>>>>>>>>>>> RF=3, that can any two of N1/N2/N3). If N1 is part of the two that
>>>>>>>>>>>>> makes the
>>>>>>>>>>>>> quorum, then TS2 will be returned, because cassandra will compare
>>>>>>>>>>>>> the
>>>>>>>>>>>>> timestamp and decide what to return based on this. If N2/N3
>>>>>>>>>>>>> responds
>>>>>>>>>>>>> however, both timestamp will be TS1 and so, after timestamp
>>>>>>>>>>>>> resolution, it
>>>>>>>>>>>>> will stil be TS1 that will be returned.
>>>>>>>>>>>>> So yes timestamp is used for conflict resolution.
>>>>>>>>>>>>> In your example, you could get TS1 back because a failed write can
>>>>>>>>>>>>> let
>>>>>>>>>>>>> you cluster in an inconsistent state. You'd have to retry the
>>>>>>>>>>>>> quorum and
>>>>>>>>>>>>> only when it succeeds can you be guaranteed that quorum read will
>>>>>>>>>>>>> always
>>>>>>>>>>>>> return TS2.
>>>>>>>>>>>>> This is because when a write fails, Cassandra doesn't guarantee
>>>>>>>>>>>>> that
>>>>>>>>>>>>> the write did not made it in (there is no revert).
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Are we on the same page with this interpretation ?
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> -JA
>>>>>>>>>>>>>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne
>>>>>>>>>>>>>> <sy...@datastax.com> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John
>>>>>>>>>>>>>>> <ch...@gmail.com> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Sylvan,
>>>>>>>>>>>>>>>> Time stamps are not used for conflict resolution - unless is is
>>>>>>>>>>>>>>>> part of the application logic!!!
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> What is you definition of conflict resolution ? Because if you
>>>>>>>>>>>>>>> update twice the same column (which
>>>>>>>>>>>>>>> I'll call a conflict), then the timestamps are used to decide
>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>> update wins (which I'll call a resolution).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> You can have "lost updates" w/Cassandra. You need to to use 3rd
>>>>>>>>>>>>>>>> products - cages for e.g. - to get ACID type consistency.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Then again, you'll have to define what you are calling "lost
>>>>>>>>>>>>>>> updates". Provided you use a reasonable consistency level,
>>>>>>>>>>>>>>> Cassandra
>>>>>>>>>>>>>>> provides fairly strong durability guarantee, so for some
>>>>>>>>>>>>>>> definition you
>>>>>>>>>>>>>>> don't "lose updates".
>>>>>>>>>>>>>>> That being said, I never pretended that Cassandra provided any
>>>>>>>>>>>>>>> ACID
>>>>>>>>>>>>>>> guarantee. ACID relates to transaction, which Cassandra doesn't
>>>>>>>>>>>>>>> support. If
>>>>>>>>>>>>>>> we're talking about the guarantees of transaction, then by all
>>>>>>>>>>>>>>> means,
>>>>>>>>>>>>>>> cassandra won't provide it. And yes you can use cages or the like
>>>>>>>>>>>>>>> to get
>>>>>>>>>>>>>>> transaction. But that was not the point of the thread, was it ?
>>>>>>>>>>>>>>> The thread
>>>>>>>>>>>>>>> is about vector clocks, and that has nothing to do with
>>>>>>>>>>>>>>> transaction (vector
>>>>>>>>>>>>>>> clocks certainly don't give you transactions).
>>>>>>>>>>>>>>> Sorry if I wasn't clear in my mail, but I was only responding to
>>>>>>>>>>>>>>> why
>>>>>>>>>>>>>>> so far I don't think vector clocks would really provide much for
>>>>>>>>>>>>>>> Cassandra.
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Sylvain
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> -JA
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne
>>>>>>>>>>>>>>>> <sy...@datastax.com> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John
>>>>>>>>>>>>>>>>> <ch...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Apologies : For some reason my response on the original mail
>>>>>>>>>>>>>>>>>> keeps bouncing back, thus this new one!
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> From the other hand, the same article says:
>>>>>>>>>>>>>>>>>>> "For conditional writes to work, the condition must be
>>>>>>>>>>>>>>>>>>> evaluated at all update
>>>>>>>>>>>>>>>>>>> sites before the write can be allowed to succeed."
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> This means, that when doing such an update CL=ALL must be
>>>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Sorry, but I am confused by that entire thread!
>>>>>>>>>>>>>>>>>> Questions:-
>>>>>>>>>>>>>>>>>> 1. Does Cassandra implement any kind of data locking - at any
>>>>>>>>>>>>>>>>>> granularity whether it be row/colF/Col ?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> No locking, no.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 2. If the answer to 1 above is NO! - how does CL ALL prevent
>>>>>>>>>>>>>>>>>> conflicts. Concurrent updates on exactly the same piece of
>>>>>>>>>>>>>>>>>> data on different
>>>>>>>>>>>>>>>>>> nodes can still mess each other up, right ?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Not sure why you are taking CL.ALL specifically. But in any CL,
>>>>>>>>>>>>>>>>> updating the same piece of data means the same column value. In
>>>>>>>>>>>>>>>>> that case,
>>>>>>>>>>>>>>>>> the resolution rules are the following:
>>>>>>>>>>>>>>>>>   - If the updates have a different timestamp, keep the one
>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>> the higher timestamp. That is, the more recent of two updates
>>>>>>>>>>>>>>>>> win.
>>>>>>>>>>>>>>>>>   - It the timestamps are the same, then it compares the values
>>>>>>>>>>>>>>>>> (byte comparison) and keep the highest value. This is just to
>>>>>>>>>>>>>>>>> break ties in
>>>>>>>>>>>>>>>>> a consistent manner.
>>>>>>>>>>>>>>>>> So if you do two truly concurrent updates (that is from two
>>>>>>>>>>>>>>>>> place
>>>>>>>>>>>>>>>>> at the same instant), then you'll end with one of the update.
>>>>>>>>>>>>>>>>> This is the
>>>>>>>>>>>>>>>>> column level.
>>>>>>>>>>>>>>>>> However, if that simple conflict detection/resolution mechanism
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> not good enough for some of your use case and you need to keep
>>>>>>>>>>>>>>>>> two
>>>>>>>>>>>>>>>>> concurrent updates, it is easy enough. Just make sure that the
>>>>>>>>>>>>>>>>> update don't
>>>>>>>>>>>>>>>>> end up in the same column. This is easily achieved by appending
>>>>>>>>>>>>>>>>> some unique
>>>>>>>>>>>>>>>>> identifier to the column name for instance. And when reading,
>>>>>>>>>>>>>>>>> do a slice and
>>>>>>>>>>>>>>>>> reconcile whatever you get back with whatever logic make sense.
>>>>>>>>>>>>>>>>> If you do
>>>>>>>>>>>>>>>>> that, congrats, you've roughly emulated what vector clocks
>>>>>>>>>>>>>>>>> would do. Btw, no
>>>>>>>>>>>>>>>>> locking or anything needed.
>>>>>>>>>>>>>>>>> In my experience, for most things the timestamp resolution is
>>>>>>>>>>>>>>>>> enough. If the same user update twice it's profile picture on
>>>>>>>>>>>>>>>>> you web site
>>>>>>>>>>>>>>>>> at the same microsecond, it's usually fine to end up with one
>>>>>>>>>>>>>>>>> of the two
>>>>>>>>>>>>>>>>> pictures. In the rare case where you need something more
>>>>>>>>>>>>>>>>> specific, using the
>>>>>>>>>>>>>>>>> cassandra data model usually solves the problem easily. The
>>>>>>>>>>>>>>>>> reason for not
>>>>>>>>>>>>>>>>> having vector clocks in Cassandra is that so far, we haven't
>>>>>>>>>>>>>>>>> really found
>>>>>>>>>>>>>>>>> much example where it is no the case.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Sylvain
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> Just to make a note the "EVENTUAL" in eventual consistency could be a
>>>> time that is less then 1ms.
>>>> 
>>>> I have a program that demonstrates that "eventual" means if i write
>>>> data at the weakest level, and read it back from a random another node
>>>> as soon as possible. 99% I see the update. I can share the code if you
>>>> would like.
>>>> 
>>>> Remember http://en.wikipedia.org/wiki/Spacetime
>>>> ...but there is no reference frame in which the two events can occur
>>>> at the same time...
>>>> 
>>>> As to MongoDB references ....Yes! most of the noSQL work differently.
>>>> They each approach CAP
>>>> http://www.julianbrowne.com/article/viewer/brewers-cap-theorem in a
>>>> different way.
>>>> 
>>>> Cassandra does not lock (it is no secret). But remember, you can not
>>>> have it all pick 2/3 from CAP.
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of DataStax, the source for professional Cassandra support
>> http://www.datastax.com
>>

Re: New Chain for : Does Cassandra use vector clocks

Posted by Jeremy Hanna <je...@gmail.com>.

And everyone has a bias - and I think most people working with any of these solutions realizes that.

I think it's interesting how many organizations use multiple data storage solutions versus just using one as they have different capabilities - like the recent Netflix news about using different data stores for different reasons.

On Feb 25, 2011, at 10:21 AM, A J wrote:

> Though you are not really implying that, I am not selling anything. I
> don't work for VoltDB. I had other issues for my use case with the
> software when I was evaluating it (their claim of durability is weak
> according to me. Though it does not matter I'd rather they call
> themselves NOSQL. they just give lip-service to SQL)
> I'd rather not drink any sort of kool-aid, get all sides (whatever the
> motive of the sides be) and be the judge myself for what I want to do.
> 
> The thread was by someone who seems to be having difficulty wrapping
> head around the gives and takes of cassandra. maybe something else is
> better for their use case.
> 
> Peace :)
> 
> 
> On Fri, Feb 25, 2011 at 10:39 AM, Jonathan Ellis <jb...@gmail.com> wrote:
>> That article is heavily biased by "I am selling a competitor to Cassandra."
>> 
>> First, read Coda's original piece if you haven't:
>> http://codahale.com/you-cant-sacrifice-partition-tolerance/
>> 
>> Then, Jeff Darcy's response: http://pl.atyp.us/wordpress/?p=3110
>> 
>> On Thu, Feb 24, 2011 at 2:56 PM, A J <s5...@gmail.com> wrote:
>>> While we are at it, there's more to consider than just CAP in distributed :)
>>> http://voltdb.com/blog/clarifications-cap-theorem-and-data-related-errors
>>> 
>>> On Thu, Feb 24, 2011 at 3:31 PM, Edward Capriolo <ed...@gmail.com> wrote:
>>>> On Thu, Feb 24, 2011 at 3:03 PM, A J <s5...@gmail.com> wrote:
>>>>> yes, that is difficult to digest and one has to be sure if the use
>>>>> case can afford it.
>>>>> 
>>>>> Some other NOSQL databases deals with it differently (though I don't
>>>>> think any of them use atomic 2-phase commit). MongoDB for example will
>>>>> ask you to read from the node you wrote first (primary node) unless
>>>>> you are ok with eventual consistency. If the write did not make to
>>>>> majority of other nodes, it will be rolled-back from the original
>>>>> primary when it comes up again as a secondary.
>>>>> In some cases, you still could server either new value (that was
>>>>> returned as failed) or the old one. But it is different from Cassandra
>>>>> in the sense that Cassandra will never rollback.
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Feb 24, 2011 at 2:47 PM, Anthony John <ch...@gmail.com> wrote:
>>>>>> The leap of faith here is that an error does not mean a clean backing out to
>>>>>> prior state - as we are used to with databases. It means that the operation
>>>>>> in error could have gone through partially
>>>>>> 
>>>>>> Again, this is not an absolutely unfamiliar territory and can be dealt with.
>>>>>> -JA
>>>>>> On Thu, Feb 24, 2011 at 1:16 PM, A J <s5...@gmail.com> wrote:
>>>>>>> 
>>>>>>>>> but could be broken in case of a failed write<<
>>>>>>> You can think of a scenario where R + W >N still leads to
>>>>>>> inconsistency even for successful writes. Say you keep W=1 and R=N .
>>>>>>> Lets say the one node where a write happened with success goes down
>>>>>>> before it made to the other N-1 nodes. Lets say it goes down for good
>>>>>>> and is unrecoverable. The only option is to build a new node from
>>>>>>> scratch from other active nodes. This will lead to a write that was
>>>>>>> lost and you will end up serving stale copy of it.
>>>>>>> 
>>>>>>> It is better to talk in terms of use cases and if cassandra will be a
>>>>>>> fit for it. Otherwise unless you have W=R=N and fsync before each
>>>>>>> write commit, there will be scope for inconsistency.
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Feb 24, 2011 at 1:25 PM, Anthony John <ch...@gmail.com>
>>>>>>> wrote:
>>>>>>>> I see the point - apologies for putting everyone through this!
>>>>>>>> It was just militating against my mental model.
>>>>>>>> In summary, here is my take away - simple stuff but - IMO - important to
>>>>>>>> conclude this thread (I hope):-
>>>>>>>> 1. I was splitting hair over a failed ( partial ) Q Write. Such an event
>>>>>>>> should be immediately followed by the same write going to a connection
>>>>>>>> on to
>>>>>>>> another node ( potentially using connection caches of client
>>>>>>>> implementations
>>>>>>>> ) or a Read at CL of All. Because a write could have partially gone
>>>>>>>> through.
>>>>>>>> 2. Timestamps are used in determining the latest version ( correcting
>>>>>>>> the
>>>>>>>> false impression I was propagating)
>>>>>>>> Finally, wrt "W + R > N for Q CL statement" holds, but could be broken
>>>>>>>> in
>>>>>>>> case of a failed write as it is unsure whether the new value got written
>>>>>>>> on
>>>>>>>>  any server or not. Is that a fair characterization ?
>>>>>>>> Bottom line - unlike traditional DBMS, errors do not ensure automatic
>>>>>>>> cleanup and revert back, app code has to follow up if  immediate - and
>>>>>>>> not
>>>>>>>> eventual -  consistency is desired. I made that leap in almost all cases
>>>>>>>> - I
>>>>>>>> think - but the case of a failed write.
>>>>>>>> My bad and I can live with this!
>>>>>>>> Regards,
>>>>>>>> -JA
>>>>>>>> 
>>>>>>>> On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne
>>>>>>>> <sy...@datastax.com>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John <ch...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Completely understand!
>>>>>>>>>> All that I am quibbling over is whether a CL of quorum guarantees
>>>>>>>>>> consistency or not. That is what the documentation says - right. IF
>>>>>>>>>> for a CL
>>>>>>>>>> of Q read - it depends on which node returns read first to determine
>>>>>>>>>> the
>>>>>>>>>> actual returned result or other more convoluted conditions , then a
>>>>>>>>>> Quorum
>>>>>>>>>> read/write is not consistent, by any definition.
>>>>>>>>> 
>>>>>>>>> But that's the point. The definition of consistency we are talking
>>>>>>>>> about
>>>>>>>>> has no meaning if you consider only a quorum read. The definition
>>>>>>>>> (which is
>>>>>>>>> the de facto definition of consistency in 'eventually consistent') make
>>>>>>>>> sense if we talk about a write followed by a read. And it is
>>>>>>>>> considering succeeding write followed by succeeding read.
>>>>>>>>> And that is the statement the wiki is making.
>>>>>>>>> Honestly, we could debate forever on the definition of consistency and
>>>>>>>>> whatnot. Cassandra guaranties that if you do a (succeeding) write on W
>>>>>>>>> replica and then a (succeeding) read on R replica and if R+W>N, then it
>>>>>>>>> is
>>>>>>>>> guaranteed that the read will see the preceding write. And this is what
>>>>>>>>> is
>>>>>>>>> called consistency in the context of eventual consistency (which is not
>>>>>>>>> the
>>>>>>>>> context of ACID).
>>>>>>>>> If this is not the definition of consistency you had in mind then by
>>>>>>>>> all
>>>>>>>>> mean, Cassandra probably don't guarantee this definition. But given
>>>>>>>>> that the
>>>>>>>>> paragraph preceding what you pasted state clearly we are not talking
>>>>>>>>> about
>>>>>>>>> ACID consistency, but eventual consistency, I don't think the wiki is
>>>>>>>>> making
>>>>>>>>> any unfair statement.
>>>>>>>>> That being said, the wiki may not be always as clear as it could. But
>>>>>>>>> it's
>>>>>>>>> an editable wiki :)
>>>>>>>>> --
>>>>>>>>> Sylvain
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I can still use Cassandra, and will use it, luv it!!! But let us not
>>>>>>>>>> make
>>>>>>>>>> this statement on the Wiki architecture section:-
>>>>>>>>>> -------------------------------------------------------------
>>>>>>>>>> 
>>>>>>>>>> More specifically: R=read replica count W=write replica
>>>>>>>>>> count N=replication factor Q=QUORUM (Q = N / 2 + 1)
>>>>>>>>>> 
>>>>>>>>>> If W + R > N, you will have consistency
>>>>>>>>>> 
>>>>>>>>>> W=1, R=N
>>>>>>>>>> W=N, R=1
>>>>>>>>>> W=Q, R=Q where Q = N / 2 + 1
>>>>>>>>>> 
>>>>>>>>>> Cassandra provides consistency when R + W > N (read replica count
>>>>>>>>>> + write
>>>>>>>>>> replica count > replication factor).
>>>>>>>>>> 
>>>>>>>>>> ----------------------------------------------------
>>>>>>>>>> 
>>>>>>>>>> .
>>>>>>>>>> 
>>>>>>>>>> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne
>>>>>>>>>> <sy...@datastax.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John <ch...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> If you are correct and you are probably closer to the code - then CL
>>>>>>>>>>>> of
>>>>>>>>>>>> Quorum does not guarantee a consistency.
>>>>>>>>>>> 
>>>>>>>>>>> If the operation succeed, it does (for some definition of consistency
>>>>>>>>>>> which is, following reads at Quorum will be guaranteed to see the new
>>>>>>>>>>> value
>>>>>>>>>>> of a update at quorum). If it fails, then no, it does not guarantee
>>>>>>>>>>> consistency.
>>>>>>>>>>> It is important to note that the word consistency has multiple
>>>>>>>>>>> meaning.
>>>>>>>>>>> In particular, when we are talking of consistency in Cassandra, we
>>>>>>>>>>> are not
>>>>>>>>>>> talking of the same definition as the C in ACID
>>>>>>>>>>> 
>>>>>>>>>>> (see: http://www.allthingsdistributed.com/2007/12/eventually_consistent.html)
>>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne
>>>>>>>>>>>> <sy...@datastax.com> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John
>>>>>>>>>>>>> <ch...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Time stamps are not used for conflict resolution - unless is is
>>>>>>>>>>>>>>>>> part of the application logic!!!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> What is you definition of conflict resolution ? Because if you
>>>>>>>>>>>>>>>> update twice the same column (which
>>>>>>>>>>>>>>>> I'll call a conflict), then the timestamps are used to decide
>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>> update wins (which I'll call a resolution).
>>>>>>>>>>>>>> I understand what you are saying, and yes semantics is very
>>>>>>>>>>>>>> important
>>>>>>>>>>>>>> here. And yes we are responding to the immediate questions without
>>>>>>>>>>>>>> covering
>>>>>>>>>>>>>> all questions in the thread.
>>>>>>>>>>>>>> The point being made here is that the timestamp of the column is
>>>>>>>>>>>>>> not
>>>>>>>>>>>>>> used by Cassandra to figure out what data to return.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Not quite true.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
>>>>>>>>>>>>>> A Quorum  Write comes and add/updates the time stamp (TS2) of a
>>>>>>>>>>>>>> particular data element. It succeeds on N1 - fails on N2/3. So the
>>>>>>>>>>>>>> write is
>>>>>>>>>>>>>> returned as failed - right ?
>>>>>>>>>>>>>> Now Quorum read comes in for exactly the same piece of data that
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> write failed for.
>>>>>>>>>>>>>> So N1 has TS2 but both N2/3 have the old TS (say TS1)
>>>>>>>>>>>>>> And the read succeeds - Will it return TS1 or TS2.
>>>>>>>>>>>>>> I submit it will return TS1 - the old TS.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> It all depends on which (first 2) nodes respond to the read (since
>>>>>>>>>>>>> RF=3, that can any two of N1/N2/N3). If N1 is part of the two that
>>>>>>>>>>>>> makes the
>>>>>>>>>>>>> quorum, then TS2 will be returned, because cassandra will compare
>>>>>>>>>>>>> the
>>>>>>>>>>>>> timestamp and decide what to return based on this. If N2/N3
>>>>>>>>>>>>> responds
>>>>>>>>>>>>> however, both timestamp will be TS1 and so, after timestamp
>>>>>>>>>>>>> resolution, it
>>>>>>>>>>>>> will stil be TS1 that will be returned.
>>>>>>>>>>>>> So yes timestamp is used for conflict resolution.
>>>>>>>>>>>>> In your example, you could get TS1 back because a failed write can
>>>>>>>>>>>>> let
>>>>>>>>>>>>> you cluster in an inconsistent state. You'd have to retry the
>>>>>>>>>>>>> quorum and
>>>>>>>>>>>>> only when it succeeds can you be guaranteed that quorum read will
>>>>>>>>>>>>> always
>>>>>>>>>>>>> return TS2.
>>>>>>>>>>>>> This is because when a write fails, Cassandra doesn't guarantee
>>>>>>>>>>>>> that
>>>>>>>>>>>>> the write did not made it in (there is no revert).
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Are we on the same page with this interpretation ?
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> -JA
>>>>>>>>>>>>>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne
>>>>>>>>>>>>>> <sy...@datastax.com> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John
>>>>>>>>>>>>>>> <ch...@gmail.com> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Sylvan,
>>>>>>>>>>>>>>>> Time stamps are not used for conflict resolution - unless is is
>>>>>>>>>>>>>>>> part of the application logic!!!
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> What is you definition of conflict resolution ? Because if you
>>>>>>>>>>>>>>> update twice the same column (which
>>>>>>>>>>>>>>> I'll call a conflict), then the timestamps are used to decide
>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>> update wins (which I'll call a resolution).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> You can have "lost updates" w/Cassandra. You need to to use 3rd
>>>>>>>>>>>>>>>> products - cages for e.g. - to get ACID type consistency.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Then again, you'll have to define what you are calling "lost
>>>>>>>>>>>>>>> updates". Provided you use a reasonable consistency level,
>>>>>>>>>>>>>>> Cassandra
>>>>>>>>>>>>>>> provides fairly strong durability guarantee, so for some
>>>>>>>>>>>>>>> definition you
>>>>>>>>>>>>>>> don't "lose updates".
>>>>>>>>>>>>>>> That being said, I never pretended that Cassandra provided any
>>>>>>>>>>>>>>> ACID
>>>>>>>>>>>>>>> guarantee. ACID relates to transaction, which Cassandra doesn't
>>>>>>>>>>>>>>> support. If
>>>>>>>>>>>>>>> we're talking about the guarantees of transaction, then by all
>>>>>>>>>>>>>>> means,
>>>>>>>>>>>>>>> cassandra won't provide it. And yes you can use cages or the like
>>>>>>>>>>>>>>> to get
>>>>>>>>>>>>>>> transaction. But that was not the point of the thread, was it ?
>>>>>>>>>>>>>>> The thread
>>>>>>>>>>>>>>> is about vector clocks, and that has nothing to do with
>>>>>>>>>>>>>>> transaction (vector
>>>>>>>>>>>>>>> clocks certainly don't give you transactions).
>>>>>>>>>>>>>>> Sorry if I wasn't clear in my mail, but I was only responding to
>>>>>>>>>>>>>>> why
>>>>>>>>>>>>>>> so far I don't think vector clocks would really provide much for
>>>>>>>>>>>>>>> Cassandra.
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> Sylvain
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> -JA
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne
>>>>>>>>>>>>>>>> <sy...@datastax.com> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John
>>>>>>>>>>>>>>>>> <ch...@gmail.com> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Apologies : For some reason my response on the original mail
>>>>>>>>>>>>>>>>>> keeps bouncing back, thus this new one!
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> From the other hand, the same article says:
>>>>>>>>>>>>>>>>>>> "For conditional writes to work, the condition must be
>>>>>>>>>>>>>>>>>>> evaluated at all update
>>>>>>>>>>>>>>>>>>> sites before the write can be allowed to succeed."
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> This means, that when doing such an update CL=ALL must be
>>>>>>>>>>>>>>>>>>> used
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Sorry, but I am confused by that entire thread!
>>>>>>>>>>>>>>>>>> Questions:-
>>>>>>>>>>>>>>>>>> 1. Does Cassandra implement any kind of data locking - at any
>>>>>>>>>>>>>>>>>> granularity whether it be row/colF/Col ?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> No locking, no.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 2. If the answer to 1 above is NO! - how does CL ALL prevent
>>>>>>>>>>>>>>>>>> conflicts. Concurrent updates on exactly the same piece of
>>>>>>>>>>>>>>>>>> data on different
>>>>>>>>>>>>>>>>>> nodes can still mess each other up, right ?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Not sure why you are taking CL.ALL specifically. But in any CL,
>>>>>>>>>>>>>>>>> updating the same piece of data means the same column value. In
>>>>>>>>>>>>>>>>> that case,
>>>>>>>>>>>>>>>>> the resolution rules are the following:
>>>>>>>>>>>>>>>>>   - If the updates have a different timestamp, keep the one
>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>> the higher timestamp. That is, the more recent of two updates
>>>>>>>>>>>>>>>>> win.
>>>>>>>>>>>>>>>>>   - It the timestamps are the same, then it compares the values
>>>>>>>>>>>>>>>>> (byte comparison) and keep the highest value. This is just to
>>>>>>>>>>>>>>>>> break ties in
>>>>>>>>>>>>>>>>> a consistent manner.
>>>>>>>>>>>>>>>>> So if you do two truly concurrent updates (that is from two
>>>>>>>>>>>>>>>>> place
>>>>>>>>>>>>>>>>> at the same instant), then you'll end with one of the update.
>>>>>>>>>>>>>>>>> This is the
>>>>>>>>>>>>>>>>> column level.
>>>>>>>>>>>>>>>>> However, if that simple conflict detection/resolution mechanism
>>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>> not good enough for some of your use case and you need to keep
>>>>>>>>>>>>>>>>> two
>>>>>>>>>>>>>>>>> concurrent updates, it is easy enough. Just make sure that the
>>>>>>>>>>>>>>>>> update don't
>>>>>>>>>>>>>>>>> end up in the same column. This is easily achieved by appending
>>>>>>>>>>>>>>>>> some unique
>>>>>>>>>>>>>>>>> identifier to the column name for instance. And when reading,
>>>>>>>>>>>>>>>>> do a slice and
>>>>>>>>>>>>>>>>> reconcile whatever you get back with whatever logic make sense.
>>>>>>>>>>>>>>>>> If you do
>>>>>>>>>>>>>>>>> that, congrats, you've roughly emulated what vector clocks
>>>>>>>>>>>>>>>>> would do. Btw, no
>>>>>>>>>>>>>>>>> locking or anything needed.
>>>>>>>>>>>>>>>>> In my experience, for most things the timestamp resolution is
>>>>>>>>>>>>>>>>> enough. If the same user update twice it's profile picture on
>>>>>>>>>>>>>>>>> you web site
>>>>>>>>>>>>>>>>> at the same microsecond, it's usually fine to end up with one
>>>>>>>>>>>>>>>>> of the two
>>>>>>>>>>>>>>>>> pictures. In the rare case where you need something more
>>>>>>>>>>>>>>>>> specific, using the
>>>>>>>>>>>>>>>>> cassandra data model usually solves the problem easily. The
>>>>>>>>>>>>>>>>> reason for not
>>>>>>>>>>>>>>>>> having vector clocks in Cassandra is that so far, we haven't
>>>>>>>>>>>>>>>>> really found
>>>>>>>>>>>>>>>>> much example where it is no the case.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> Sylvain
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> Just to make a note the "EVENTUAL" in eventual consistency could be a
>>>> time that is less then 1ms.
>>>> 
>>>> I have a program that demonstrates that "eventual" means if i write
>>>> data at the weakest level, and read it back from a random another node
>>>> as soon as possible. 99% I see the update. I can share the code if you
>>>> would like.
>>>> 
>>>> Remember http://en.wikipedia.org/wiki/Spacetime
>>>> ...but there is no reference frame in which the two events can occur
>>>> at the same time...
>>>> 
>>>> As to MongoDB references ....Yes! most of the noSQL work differently.
>>>> They each approach CAP
>>>> http://www.julianbrowne.com/article/viewer/brewers-cap-theorem in a
>>>> different way.
>>>> 
>>>> Cassandra does not lock (it is no secret). But remember, you can not
>>>> have it all pick 2/3 from CAP.
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of DataStax, the source for professional Cassandra support
>> http://www.datastax.com
>>

Re: New Chain for : Does Cassandra use vector clocks

Posted by A J <s5...@gmail.com>.

Though you are not really implying that, I am not selling anything. I
don't work for VoltDB. I had other issues for my use case with the
software when I was evaluating it (their claim of durability is weak
according to me. Though it does not matter I'd rather they call
themselves NOSQL. they just give lip-service to SQL)
I'd rather not drink any sort of kool-aid, get all sides (whatever the
motive of the sides be) and be the judge myself for what I want to do.

The thread was by someone who seems to be having difficulty wrapping
head around the gives and takes of cassandra. maybe something else is
better for their use case.

Peace :)


On Fri, Feb 25, 2011 at 10:39 AM, Jonathan Ellis <jb...@gmail.com> wrote:
> That article is heavily biased by "I am selling a competitor to Cassandra."
>
> First, read Coda's original piece if you haven't:
> http://codahale.com/you-cant-sacrifice-partition-tolerance/
>
> Then, Jeff Darcy's response: http://pl.atyp.us/wordpress/?p=3110
>
> On Thu, Feb 24, 2011 at 2:56 PM, A J <s5...@gmail.com> wrote:
>> While we are at it, there's more to consider than just CAP in distributed :)
>> http://voltdb.com/blog/clarifications-cap-theorem-and-data-related-errors
>>
>> On Thu, Feb 24, 2011 at 3:31 PM, Edward Capriolo <ed...@gmail.com> wrote:
>>> On Thu, Feb 24, 2011 at 3:03 PM, A J <s5...@gmail.com> wrote:
>>>> yes, that is difficult to digest and one has to be sure if the use
>>>> case can afford it.
>>>>
>>>> Some other NOSQL databases deals with it differently (though I don't
>>>> think any of them use atomic 2-phase commit). MongoDB for example will
>>>> ask you to read from the node you wrote first (primary node) unless
>>>> you are ok with eventual consistency. If the write did not make to
>>>> majority of other nodes, it will be rolled-back from the original
>>>> primary when it comes up again as a secondary.
>>>> In some cases, you still could server either new value (that was
>>>> returned as failed) or the old one. But it is different from Cassandra
>>>> in the sense that Cassandra will never rollback.
>>>>
>>>>
>>>>
>>>> On Thu, Feb 24, 2011 at 2:47 PM, Anthony John <ch...@gmail.com> wrote:
>>>>> The leap of faith here is that an error does not mean a clean backing out to
>>>>> prior state - as we are used to with databases. It means that the operation
>>>>> in error could have gone through partially
>>>>>
>>>>> Again, this is not an absolutely unfamiliar territory and can be dealt with.
>>>>> -JA
>>>>> On Thu, Feb 24, 2011 at 1:16 PM, A J <s5...@gmail.com> wrote:
>>>>>>
>>>>>> >>but could be broken in case of a failed write<<
>>>>>> You can think of a scenario where R + W >N still leads to
>>>>>> inconsistency even for successful writes. Say you keep W=1 and R=N .
>>>>>> Lets say the one node where a write happened with success goes down
>>>>>> before it made to the other N-1 nodes. Lets say it goes down for good
>>>>>> and is unrecoverable. The only option is to build a new node from
>>>>>> scratch from other active nodes. This will lead to a write that was
>>>>>> lost and you will end up serving stale copy of it.
>>>>>>
>>>>>> It is better to talk in terms of use cases and if cassandra will be a
>>>>>> fit for it. Otherwise unless you have W=R=N and fsync before each
>>>>>> write commit, there will be scope for inconsistency.
>>>>>>
>>>>>>
>>>>>> On Thu, Feb 24, 2011 at 1:25 PM, Anthony John <ch...@gmail.com>
>>>>>> wrote:
>>>>>> > I see the point - apologies for putting everyone through this!
>>>>>> > It was just militating against my mental model.
>>>>>> > In summary, here is my take away - simple stuff but - IMO - important to
>>>>>> > conclude this thread (I hope):-
>>>>>> > 1. I was splitting hair over a failed ( partial ) Q Write. Such an event
>>>>>> > should be immediately followed by the same write going to a connection
>>>>>> > on to
>>>>>> > another node ( potentially using connection caches of client
>>>>>> > implementations
>>>>>> > ) or a Read at CL of All. Because a write could have partially gone
>>>>>> > through.
>>>>>> > 2. Timestamps are used in determining the latest version ( correcting
>>>>>> > the
>>>>>> > false impression I was propagating)
>>>>>> > Finally, wrt "W + R > N for Q CL statement" holds, but could be broken
>>>>>> > in
>>>>>> > case of a failed write as it is unsure whether the new value got written
>>>>>> > on
>>>>>> >  any server or not. Is that a fair characterization ?
>>>>>> > Bottom line - unlike traditional DBMS, errors do not ensure automatic
>>>>>> > cleanup and revert back, app code has to follow up if  immediate - and
>>>>>> > not
>>>>>> > eventual -  consistency is desired. I made that leap in almost all cases
>>>>>> > - I
>>>>>> > think - but the case of a failed write.
>>>>>> > My bad and I can live with this!
>>>>>> > Regards,
>>>>>> > -JA
>>>>>> >
>>>>>> > On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne
>>>>>> > <sy...@datastax.com>
>>>>>> > wrote:
>>>>>> >>
>>>>>> >> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John <ch...@gmail.com>
>>>>>> >> wrote:
>>>>>> >>>
>>>>>> >>> Completely understand!
>>>>>> >>> All that I am quibbling over is whether a CL of quorum guarantees
>>>>>> >>> consistency or not. That is what the documentation says - right. IF
>>>>>> >>> for a CL
>>>>>> >>> of Q read - it depends on which node returns read first to determine
>>>>>> >>> the
>>>>>> >>> actual returned result or other more convoluted conditions , then a
>>>>>> >>> Quorum
>>>>>> >>> read/write is not consistent, by any definition.
>>>>>> >>
>>>>>> >> But that's the point. The definition of consistency we are talking
>>>>>> >> about
>>>>>> >> has no meaning if you consider only a quorum read. The definition
>>>>>> >> (which is
>>>>>> >> the de facto definition of consistency in 'eventually consistent') make
>>>>>> >> sense if we talk about a write followed by a read. And it is
>>>>>> >> considering succeeding write followed by succeeding read.
>>>>>> >> And that is the statement the wiki is making.
>>>>>> >> Honestly, we could debate forever on the definition of consistency and
>>>>>> >> whatnot. Cassandra guaranties that if you do a (succeeding) write on W
>>>>>> >> replica and then a (succeeding) read on R replica and if R+W>N, then it
>>>>>> >> is
>>>>>> >> guaranteed that the read will see the preceding write. And this is what
>>>>>> >> is
>>>>>> >> called consistency in the context of eventual consistency (which is not
>>>>>> >> the
>>>>>> >> context of ACID).
>>>>>> >> If this is not the definition of consistency you had in mind then by
>>>>>> >> all
>>>>>> >> mean, Cassandra probably don't guarantee this definition. But given
>>>>>> >> that the
>>>>>> >> paragraph preceding what you pasted state clearly we are not talking
>>>>>> >> about
>>>>>> >> ACID consistency, but eventual consistency, I don't think the wiki is
>>>>>> >> making
>>>>>> >> any unfair statement.
>>>>>> >> That being said, the wiki may not be always as clear as it could. But
>>>>>> >> it's
>>>>>> >> an editable wiki :)
>>>>>> >> --
>>>>>> >> Sylvain
>>>>>> >>
>>>>>> >>>
>>>>>> >>> I can still use Cassandra, and will use it, luv it!!! But let us not
>>>>>> >>> make
>>>>>> >>> this statement on the Wiki architecture section:-
>>>>>> >>> -------------------------------------------------------------
>>>>>> >>>
>>>>>> >>> More specifically: R=read replica count W=write replica
>>>>>> >>> count N=replication factor Q=QUORUM (Q = N / 2 + 1)
>>>>>> >>>
>>>>>> >>> If W + R > N, you will have consistency
>>>>>> >>>
>>>>>> >>> W=1, R=N
>>>>>> >>> W=N, R=1
>>>>>> >>> W=Q, R=Q where Q = N / 2 + 1
>>>>>> >>>
>>>>>> >>> Cassandra provides consistency when R + W > N (read replica count
>>>>>> >>> + write
>>>>>> >>> replica count > replication factor).
>>>>>> >>>
>>>>>> >>> ----------------------------------------------------
>>>>>> >>>
>>>>>> >>> .
>>>>>> >>>
>>>>>> >>> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne
>>>>>> >>> <sy...@datastax.com>
>>>>>> >>> wrote:
>>>>>> >>>>
>>>>>> >>>> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John <ch...@gmail.com>
>>>>>> >>>> wrote:
>>>>>> >>>>>
>>>>>> >>>>> If you are correct and you are probably closer to the code - then CL
>>>>>> >>>>> of
>>>>>> >>>>> Quorum does not guarantee a consistency.
>>>>>> >>>>
>>>>>> >>>> If the operation succeed, it does (for some definition of consistency
>>>>>> >>>> which is, following reads at Quorum will be guaranteed to see the new
>>>>>> >>>> value
>>>>>> >>>> of a update at quorum). If it fails, then no, it does not guarantee
>>>>>> >>>> consistency.
>>>>>> >>>> It is important to note that the word consistency has multiple
>>>>>> >>>> meaning.
>>>>>> >>>> In particular, when we are talking of consistency in Cassandra, we
>>>>>> >>>> are not
>>>>>> >>>> talking of the same definition as the C in ACID
>>>>>> >>>>
>>>>>> >>>> (see: http://www.allthingsdistributed.com/2007/12/eventually_consistent.html)
>>>>>> >>>>>
>>>>>> >>>>> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne
>>>>>> >>>>> <sy...@datastax.com> wrote:
>>>>>> >>>>>>
>>>>>> >>>>>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John
>>>>>> >>>>>> <ch...@gmail.com>
>>>>>> >>>>>> wrote:
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> >>Time stamps are not used for conflict resolution - unless is is
>>>>>> >>>>>>>> >> part of the application logic!!!
>>>>>> >>>>>>>
>>>>>> >>>>>>> >>What is you definition of conflict resolution ? Because if you
>>>>>> >>>>>>> >> update twice the same column (which
>>>>>> >>>>>>> >>I'll call a conflict), then the timestamps are used to decide
>>>>>> >>>>>>> >> which
>>>>>> >>>>>>> >> update wins (which I'll call a resolution).
>>>>>> >>>>>>> I understand what you are saying, and yes semantics is very
>>>>>> >>>>>>> important
>>>>>> >>>>>>> here. And yes we are responding to the immediate questions without
>>>>>> >>>>>>> covering
>>>>>> >>>>>>> all questions in the thread.
>>>>>> >>>>>>> The point being made here is that the timestamp of the column is
>>>>>> >>>>>>> not
>>>>>> >>>>>>> used by Cassandra to figure out what data to return.
>>>>>> >>>>>>
>>>>>> >>>>>> Not quite true.
>>>>>> >>>>>>>
>>>>>> >>>>>>> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
>>>>>> >>>>>>> A Quorum  Write comes and add/updates the time stamp (TS2) of a
>>>>>> >>>>>>> particular data element. It succeeds on N1 - fails on N2/3. So the
>>>>>> >>>>>>> write is
>>>>>> >>>>>>> returned as failed - right ?
>>>>>> >>>>>>> Now Quorum read comes in for exactly the same piece of data that
>>>>>> >>>>>>> the
>>>>>> >>>>>>> write failed for.
>>>>>> >>>>>>> So N1 has TS2 but both N2/3 have the old TS (say TS1)
>>>>>> >>>>>>> And the read succeeds - Will it return TS1 or TS2.
>>>>>> >>>>>>> I submit it will return TS1 - the old TS.
>>>>>> >>>>>>
>>>>>> >>>>>> It all depends on which (first 2) nodes respond to the read (since
>>>>>> >>>>>> RF=3, that can any two of N1/N2/N3). If N1 is part of the two that
>>>>>> >>>>>> makes the
>>>>>> >>>>>> quorum, then TS2 will be returned, because cassandra will compare
>>>>>> >>>>>> the
>>>>>> >>>>>> timestamp and decide what to return based on this. If N2/N3
>>>>>> >>>>>> responds
>>>>>> >>>>>> however, both timestamp will be TS1 and so, after timestamp
>>>>>> >>>>>> resolution, it
>>>>>> >>>>>> will stil be TS1 that will be returned.
>>>>>> >>>>>> So yes timestamp is used for conflict resolution.
>>>>>> >>>>>> In your example, you could get TS1 back because a failed write can
>>>>>> >>>>>> let
>>>>>> >>>>>> you cluster in an inconsistent state. You'd have to retry the
>>>>>> >>>>>> quorum and
>>>>>> >>>>>> only when it succeeds can you be guaranteed that quorum read will
>>>>>> >>>>>> always
>>>>>> >>>>>> return TS2.
>>>>>> >>>>>> This is because when a write fails, Cassandra doesn't guarantee
>>>>>> >>>>>> that
>>>>>> >>>>>> the write did not made it in (there is no revert).
>>>>>> >>>>>>
>>>>>> >>>>>>>
>>>>>> >>>>>>> Are we on the same page with this interpretation ?
>>>>>> >>>>>>> Regards,
>>>>>> >>>>>>> -JA
>>>>>> >>>>>>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne
>>>>>> >>>>>>> <sy...@datastax.com> wrote:
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John
>>>>>> >>>>>>>> <ch...@gmail.com> wrote:
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>> Sylvan,
>>>>>> >>>>>>>>> Time stamps are not used for conflict resolution - unless is is
>>>>>> >>>>>>>>> part of the application logic!!!
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> What is you definition of conflict resolution ? Because if you
>>>>>> >>>>>>>> update twice the same column (which
>>>>>> >>>>>>>> I'll call a conflict), then the timestamps are used to decide
>>>>>> >>>>>>>> which
>>>>>> >>>>>>>> update wins (which I'll call a resolution).
>>>>>> >>>>>>>>
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>> You can have "lost updates" w/Cassandra. You need to to use 3rd
>>>>>> >>>>>>>>> products - cages for e.g. - to get ACID type consistency.
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> Then again, you'll have to define what you are calling "lost
>>>>>> >>>>>>>> updates". Provided you use a reasonable consistency level,
>>>>>> >>>>>>>> Cassandra
>>>>>> >>>>>>>> provides fairly strong durability guarantee, so for some
>>>>>> >>>>>>>> definition you
>>>>>> >>>>>>>> don't "lose updates".
>>>>>> >>>>>>>> That being said, I never pretended that Cassandra provided any
>>>>>> >>>>>>>> ACID
>>>>>> >>>>>>>> guarantee. ACID relates to transaction, which Cassandra doesn't
>>>>>> >>>>>>>> support. If
>>>>>> >>>>>>>> we're talking about the guarantees of transaction, then by all
>>>>>> >>>>>>>> means,
>>>>>> >>>>>>>> cassandra won't provide it. And yes you can use cages or the like
>>>>>> >>>>>>>> to get
>>>>>> >>>>>>>> transaction. But that was not the point of the thread, was it ?
>>>>>> >>>>>>>> The thread
>>>>>> >>>>>>>> is about vector clocks, and that has nothing to do with
>>>>>> >>>>>>>> transaction (vector
>>>>>> >>>>>>>> clocks certainly don't give you transactions).
>>>>>> >>>>>>>> Sorry if I wasn't clear in my mail, but I was only responding to
>>>>>> >>>>>>>> why
>>>>>> >>>>>>>> so far I don't think vector clocks would really provide much for
>>>>>> >>>>>>>> Cassandra.
>>>>>> >>>>>>>> --
>>>>>> >>>>>>>> Sylvain
>>>>>> >>>>>>>>
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>> -JA
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne
>>>>>> >>>>>>>>> <sy...@datastax.com> wrote:
>>>>>> >>>>>>>>>>
>>>>>> >>>>>>>>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John
>>>>>> >>>>>>>>>> <ch...@gmail.com> wrote:
>>>>>> >>>>>>>>>>>
>>>>>> >>>>>>>>>>> Apologies : For some reason my response on the original mail
>>>>>> >>>>>>>>>>> keeps bouncing back, thus this new one!
>>>>>> >>>>>>>>>>>
>>>>>> >>>>>>>>>>> > From the other hand, the same article says:
>>>>>> >>>>>>>>>>> > "For conditional writes to work, the condition must be
>>>>>> >>>>>>>>>>> > evaluated at all update
>>>>>> >>>>>>>>>>> > sites before the write can be allowed to succeed."
>>>>>> >>>>>>>>>>> >
>>>>>> >>>>>>>>>>> > This means, that when doing such an update CL=ALL must be
>>>>>> >>>>>>>>>>> > used
>>>>>> >>>>>>>>>>>
>>>>>> >>>>>>>>>>> Sorry, but I am confused by that entire thread!
>>>>>> >>>>>>>>>>> Questions:-
>>>>>> >>>>>>>>>>> 1. Does Cassandra implement any kind of data locking - at any
>>>>>> >>>>>>>>>>> granularity whether it be row/colF/Col ?
>>>>>> >>>>>>>>>>
>>>>>> >>>>>>>>>> No locking, no.
>>>>>> >>>>>>>>>>
>>>>>> >>>>>>>>>>>
>>>>>> >>>>>>>>>>> 2. If the answer to 1 above is NO! - how does CL ALL prevent
>>>>>> >>>>>>>>>>> conflicts. Concurrent updates on exactly the same piece of
>>>>>> >>>>>>>>>>> data on different
>>>>>> >>>>>>>>>>> nodes can still mess each other up, right ?
>>>>>> >>>>>>>>>>
>>>>>> >>>>>>>>>> Not sure why you are taking CL.ALL specifically. But in any CL,
>>>>>> >>>>>>>>>> updating the same piece of data means the same column value. In
>>>>>> >>>>>>>>>> that case,
>>>>>> >>>>>>>>>> the resolution rules are the following:
>>>>>> >>>>>>>>>>   - If the updates have a different timestamp, keep the one
>>>>>> >>>>>>>>>> with
>>>>>> >>>>>>>>>> the higher timestamp. That is, the more recent of two updates
>>>>>> >>>>>>>>>> win.
>>>>>> >>>>>>>>>>   - It the timestamps are the same, then it compares the values
>>>>>> >>>>>>>>>> (byte comparison) and keep the highest value. This is just to
>>>>>> >>>>>>>>>> break ties in
>>>>>> >>>>>>>>>> a consistent manner.
>>>>>> >>>>>>>>>> So if you do two truly concurrent updates (that is from two
>>>>>> >>>>>>>>>> place
>>>>>> >>>>>>>>>> at the same instant), then you'll end with one of the update.
>>>>>> >>>>>>>>>> This is the
>>>>>> >>>>>>>>>> column level.
>>>>>> >>>>>>>>>> However, if that simple conflict detection/resolution mechanism
>>>>>> >>>>>>>>>> is
>>>>>> >>>>>>>>>> not good enough for some of your use case and you need to keep
>>>>>> >>>>>>>>>> two
>>>>>> >>>>>>>>>> concurrent updates, it is easy enough. Just make sure that the
>>>>>> >>>>>>>>>> update don't
>>>>>> >>>>>>>>>> end up in the same column. This is easily achieved by appending
>>>>>> >>>>>>>>>> some unique
>>>>>> >>>>>>>>>> identifier to the column name for instance. And when reading,
>>>>>> >>>>>>>>>> do a slice and
>>>>>> >>>>>>>>>> reconcile whatever you get back with whatever logic make sense.
>>>>>> >>>>>>>>>> If you do
>>>>>> >>>>>>>>>> that, congrats, you've roughly emulated what vector clocks
>>>>>> >>>>>>>>>> would do. Btw, no
>>>>>> >>>>>>>>>> locking or anything needed.
>>>>>> >>>>>>>>>> In my experience, for most things the timestamp resolution is
>>>>>> >>>>>>>>>> enough. If the same user update twice it's profile picture on
>>>>>> >>>>>>>>>> you web site
>>>>>> >>>>>>>>>> at the same microsecond, it's usually fine to end up with one
>>>>>> >>>>>>>>>> of the two
>>>>>> >>>>>>>>>> pictures. In the rare case where you need something more
>>>>>> >>>>>>>>>> specific, using the
>>>>>> >>>>>>>>>> cassandra data model usually solves the problem easily. The
>>>>>> >>>>>>>>>> reason for not
>>>>>> >>>>>>>>>> having vector clocks in Cassandra is that so far, we haven't
>>>>>> >>>>>>>>>> really found
>>>>>> >>>>>>>>>> much example where it is no the case.
>>>>>> >>>>>>>>>>
>>>>>> >>>>>>>>>> --
>>>>>> >>>>>>>>>> Sylvain
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>
>>>>>> >>>>>>>
>>>>>> >>>>>>
>>>>>> >>>>>
>>>>>> >>>>
>>>>>> >>>
>>>>>> >>
>>>>>> >
>>>>>> >
>>>>>
>>>>>
>>>>
>>>
>>>
>>> Just to make a note the "EVENTUAL" in eventual consistency could be a
>>> time that is less then 1ms.
>>>
>>> I have a program that demonstrates that "eventual" means if i write
>>> data at the weakest level, and read it back from a random another node
>>> as soon as possible. 99% I see the update. I can share the code if you
>>> would like.
>>>
>>> Remember http://en.wikipedia.org/wiki/Spacetime
>>> ...but there is no reference frame in which the two events can occur
>>> at the same time...
>>>
>>> As to MongoDB references ....Yes! most of the noSQL work differently.
>>> They each approach CAP
>>> http://www.julianbrowne.com/article/viewer/brewers-cap-theorem in a
>>> different way.
>>>
>>> Cassandra does not lock (it is no secret). But remember, you can not
>>> have it all pick 2/3 from CAP.
>>>
>>
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>

Re: New Chain for : Does Cassandra use vector clocks

Posted by Jonathan Ellis <jb...@gmail.com>.

That article is heavily biased by "I am selling a competitor to Cassandra."

First, read Coda's original piece if you haven't:
http://codahale.com/you-cant-sacrifice-partition-tolerance/

Then, Jeff Darcy's response: http://pl.atyp.us/wordpress/?p=3110

On Thu, Feb 24, 2011 at 2:56 PM, A J <s5...@gmail.com> wrote:
> While we are at it, there's more to consider than just CAP in distributed :)
> http://voltdb.com/blog/clarifications-cap-theorem-and-data-related-errors
>
> On Thu, Feb 24, 2011 at 3:31 PM, Edward Capriolo <ed...@gmail.com> wrote:
>> On Thu, Feb 24, 2011 at 3:03 PM, A J <s5...@gmail.com> wrote:
>>> yes, that is difficult to digest and one has to be sure if the use
>>> case can afford it.
>>>
>>> Some other NOSQL databases deals with it differently (though I don't
>>> think any of them use atomic 2-phase commit). MongoDB for example will
>>> ask you to read from the node you wrote first (primary node) unless
>>> you are ok with eventual consistency. If the write did not make to
>>> majority of other nodes, it will be rolled-back from the original
>>> primary when it comes up again as a secondary.
>>> In some cases, you still could server either new value (that was
>>> returned as failed) or the old one. But it is different from Cassandra
>>> in the sense that Cassandra will never rollback.
>>>
>>>
>>>
>>> On Thu, Feb 24, 2011 at 2:47 PM, Anthony John <ch...@gmail.com> wrote:
>>>> The leap of faith here is that an error does not mean a clean backing out to
>>>> prior state - as we are used to with databases. It means that the operation
>>>> in error could have gone through partially
>>>>
>>>> Again, this is not an absolutely unfamiliar territory and can be dealt with.
>>>> -JA
>>>> On Thu, Feb 24, 2011 at 1:16 PM, A J <s5...@gmail.com> wrote:
>>>>>
>>>>> >>but could be broken in case of a failed write<<
>>>>> You can think of a scenario where R + W >N still leads to
>>>>> inconsistency even for successful writes. Say you keep W=1 and R=N .
>>>>> Lets say the one node where a write happened with success goes down
>>>>> before it made to the other N-1 nodes. Lets say it goes down for good
>>>>> and is unrecoverable. The only option is to build a new node from
>>>>> scratch from other active nodes. This will lead to a write that was
>>>>> lost and you will end up serving stale copy of it.
>>>>>
>>>>> It is better to talk in terms of use cases and if cassandra will be a
>>>>> fit for it. Otherwise unless you have W=R=N and fsync before each
>>>>> write commit, there will be scope for inconsistency.
>>>>>
>>>>>
>>>>> On Thu, Feb 24, 2011 at 1:25 PM, Anthony John <ch...@gmail.com>
>>>>> wrote:
>>>>> > I see the point - apologies for putting everyone through this!
>>>>> > It was just militating against my mental model.
>>>>> > In summary, here is my take away - simple stuff but - IMO - important to
>>>>> > conclude this thread (I hope):-
>>>>> > 1. I was splitting hair over a failed ( partial ) Q Write. Such an event
>>>>> > should be immediately followed by the same write going to a connection
>>>>> > on to
>>>>> > another node ( potentially using connection caches of client
>>>>> > implementations
>>>>> > ) or a Read at CL of All. Because a write could have partially gone
>>>>> > through.
>>>>> > 2. Timestamps are used in determining the latest version ( correcting
>>>>> > the
>>>>> > false impression I was propagating)
>>>>> > Finally, wrt "W + R > N for Q CL statement" holds, but could be broken
>>>>> > in
>>>>> > case of a failed write as it is unsure whether the new value got written
>>>>> > on
>>>>> >  any server or not. Is that a fair characterization ?
>>>>> > Bottom line - unlike traditional DBMS, errors do not ensure automatic
>>>>> > cleanup and revert back, app code has to follow up if  immediate - and
>>>>> > not
>>>>> > eventual -  consistency is desired. I made that leap in almost all cases
>>>>> > - I
>>>>> > think - but the case of a failed write.
>>>>> > My bad and I can live with this!
>>>>> > Regards,
>>>>> > -JA
>>>>> >
>>>>> > On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne
>>>>> > <sy...@datastax.com>
>>>>> > wrote:
>>>>> >>
>>>>> >> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John <ch...@gmail.com>
>>>>> >> wrote:
>>>>> >>>
>>>>> >>> Completely understand!
>>>>> >>> All that I am quibbling over is whether a CL of quorum guarantees
>>>>> >>> consistency or not. That is what the documentation says - right. IF
>>>>> >>> for a CL
>>>>> >>> of Q read - it depends on which node returns read first to determine
>>>>> >>> the
>>>>> >>> actual returned result or other more convoluted conditions , then a
>>>>> >>> Quorum
>>>>> >>> read/write is not consistent, by any definition.
>>>>> >>
>>>>> >> But that's the point. The definition of consistency we are talking
>>>>> >> about
>>>>> >> has no meaning if you consider only a quorum read. The definition
>>>>> >> (which is
>>>>> >> the de facto definition of consistency in 'eventually consistent') make
>>>>> >> sense if we talk about a write followed by a read. And it is
>>>>> >> considering succeeding write followed by succeeding read.
>>>>> >> And that is the statement the wiki is making.
>>>>> >> Honestly, we could debate forever on the definition of consistency and
>>>>> >> whatnot. Cassandra guaranties that if you do a (succeeding) write on W
>>>>> >> replica and then a (succeeding) read on R replica and if R+W>N, then it
>>>>> >> is
>>>>> >> guaranteed that the read will see the preceding write. And this is what
>>>>> >> is
>>>>> >> called consistency in the context of eventual consistency (which is not
>>>>> >> the
>>>>> >> context of ACID).
>>>>> >> If this is not the definition of consistency you had in mind then by
>>>>> >> all
>>>>> >> mean, Cassandra probably don't guarantee this definition. But given
>>>>> >> that the
>>>>> >> paragraph preceding what you pasted state clearly we are not talking
>>>>> >> about
>>>>> >> ACID consistency, but eventual consistency, I don't think the wiki is
>>>>> >> making
>>>>> >> any unfair statement.
>>>>> >> That being said, the wiki may not be always as clear as it could. But
>>>>> >> it's
>>>>> >> an editable wiki :)
>>>>> >> --
>>>>> >> Sylvain
>>>>> >>
>>>>> >>>
>>>>> >>> I can still use Cassandra, and will use it, luv it!!! But let us not
>>>>> >>> make
>>>>> >>> this statement on the Wiki architecture section:-
>>>>> >>> -------------------------------------------------------------
>>>>> >>>
>>>>> >>> More specifically: R=read replica count W=write replica
>>>>> >>> count N=replication factor Q=QUORUM (Q = N / 2 + 1)
>>>>> >>>
>>>>> >>> If W + R > N, you will have consistency
>>>>> >>>
>>>>> >>> W=1, R=N
>>>>> >>> W=N, R=1
>>>>> >>> W=Q, R=Q where Q = N / 2 + 1
>>>>> >>>
>>>>> >>> Cassandra provides consistency when R + W > N (read replica count
>>>>> >>> + write
>>>>> >>> replica count > replication factor).
>>>>> >>>
>>>>> >>> ----------------------------------------------------
>>>>> >>>
>>>>> >>> .
>>>>> >>>
>>>>> >>> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne
>>>>> >>> <sy...@datastax.com>
>>>>> >>> wrote:
>>>>> >>>>
>>>>> >>>> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John <ch...@gmail.com>
>>>>> >>>> wrote:
>>>>> >>>>>
>>>>> >>>>> If you are correct and you are probably closer to the code - then CL
>>>>> >>>>> of
>>>>> >>>>> Quorum does not guarantee a consistency.
>>>>> >>>>
>>>>> >>>> If the operation succeed, it does (for some definition of consistency
>>>>> >>>> which is, following reads at Quorum will be guaranteed to see the new
>>>>> >>>> value
>>>>> >>>> of a update at quorum). If it fails, then no, it does not guarantee
>>>>> >>>> consistency.
>>>>> >>>> It is important to note that the word consistency has multiple
>>>>> >>>> meaning.
>>>>> >>>> In particular, when we are talking of consistency in Cassandra, we
>>>>> >>>> are not
>>>>> >>>> talking of the same definition as the C in ACID
>>>>> >>>>
>>>>> >>>> (see: http://www.allthingsdistributed.com/2007/12/eventually_consistent.html)
>>>>> >>>>>
>>>>> >>>>> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne
>>>>> >>>>> <sy...@datastax.com> wrote:
>>>>> >>>>>>
>>>>> >>>>>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John
>>>>> >>>>>> <ch...@gmail.com>
>>>>> >>>>>> wrote:
>>>>> >>>>>>>>
>>>>> >>>>>>>> >>Time stamps are not used for conflict resolution - unless is is
>>>>> >>>>>>>> >> part of the application logic!!!
>>>>> >>>>>>>
>>>>> >>>>>>> >>What is you definition of conflict resolution ? Because if you
>>>>> >>>>>>> >> update twice the same column (which
>>>>> >>>>>>> >>I'll call a conflict), then the timestamps are used to decide
>>>>> >>>>>>> >> which
>>>>> >>>>>>> >> update wins (which I'll call a resolution).
>>>>> >>>>>>> I understand what you are saying, and yes semantics is very
>>>>> >>>>>>> important
>>>>> >>>>>>> here. And yes we are responding to the immediate questions without
>>>>> >>>>>>> covering
>>>>> >>>>>>> all questions in the thread.
>>>>> >>>>>>> The point being made here is that the timestamp of the column is
>>>>> >>>>>>> not
>>>>> >>>>>>> used by Cassandra to figure out what data to return.
>>>>> >>>>>>
>>>>> >>>>>> Not quite true.
>>>>> >>>>>>>
>>>>> >>>>>>> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
>>>>> >>>>>>> A Quorum  Write comes and add/updates the time stamp (TS2) of a
>>>>> >>>>>>> particular data element. It succeeds on N1 - fails on N2/3. So the
>>>>> >>>>>>> write is
>>>>> >>>>>>> returned as failed - right ?
>>>>> >>>>>>> Now Quorum read comes in for exactly the same piece of data that
>>>>> >>>>>>> the
>>>>> >>>>>>> write failed for.
>>>>> >>>>>>> So N1 has TS2 but both N2/3 have the old TS (say TS1)
>>>>> >>>>>>> And the read succeeds - Will it return TS1 or TS2.
>>>>> >>>>>>> I submit it will return TS1 - the old TS.
>>>>> >>>>>>
>>>>> >>>>>> It all depends on which (first 2) nodes respond to the read (since
>>>>> >>>>>> RF=3, that can any two of N1/N2/N3). If N1 is part of the two that
>>>>> >>>>>> makes the
>>>>> >>>>>> quorum, then TS2 will be returned, because cassandra will compare
>>>>> >>>>>> the
>>>>> >>>>>> timestamp and decide what to return based on this. If N2/N3
>>>>> >>>>>> responds
>>>>> >>>>>> however, both timestamp will be TS1 and so, after timestamp
>>>>> >>>>>> resolution, it
>>>>> >>>>>> will stil be TS1 that will be returned.
>>>>> >>>>>> So yes timestamp is used for conflict resolution.
>>>>> >>>>>> In your example, you could get TS1 back because a failed write can
>>>>> >>>>>> let
>>>>> >>>>>> you cluster in an inconsistent state. You'd have to retry the
>>>>> >>>>>> quorum and
>>>>> >>>>>> only when it succeeds can you be guaranteed that quorum read will
>>>>> >>>>>> always
>>>>> >>>>>> return TS2.
>>>>> >>>>>> This is because when a write fails, Cassandra doesn't guarantee
>>>>> >>>>>> that
>>>>> >>>>>> the write did not made it in (there is no revert).
>>>>> >>>>>>
>>>>> >>>>>>>
>>>>> >>>>>>> Are we on the same page with this interpretation ?
>>>>> >>>>>>> Regards,
>>>>> >>>>>>> -JA
>>>>> >>>>>>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne
>>>>> >>>>>>> <sy...@datastax.com> wrote:
>>>>> >>>>>>>>
>>>>> >>>>>>>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John
>>>>> >>>>>>>> <ch...@gmail.com> wrote:
>>>>> >>>>>>>>>
>>>>> >>>>>>>>> Sylvan,
>>>>> >>>>>>>>> Time stamps are not used for conflict resolution - unless is is
>>>>> >>>>>>>>> part of the application logic!!!
>>>>> >>>>>>>>
>>>>> >>>>>>>> What is you definition of conflict resolution ? Because if you
>>>>> >>>>>>>> update twice the same column (which
>>>>> >>>>>>>> I'll call a conflict), then the timestamps are used to decide
>>>>> >>>>>>>> which
>>>>> >>>>>>>> update wins (which I'll call a resolution).
>>>>> >>>>>>>>
>>>>> >>>>>>>>>
>>>>> >>>>>>>>> You can have "lost updates" w/Cassandra. You need to to use 3rd
>>>>> >>>>>>>>> products - cages for e.g. - to get ACID type consistency.
>>>>> >>>>>>>>
>>>>> >>>>>>>> Then again, you'll have to define what you are calling "lost
>>>>> >>>>>>>> updates". Provided you use a reasonable consistency level,
>>>>> >>>>>>>> Cassandra
>>>>> >>>>>>>> provides fairly strong durability guarantee, so for some
>>>>> >>>>>>>> definition you
>>>>> >>>>>>>> don't "lose updates".
>>>>> >>>>>>>> That being said, I never pretended that Cassandra provided any
>>>>> >>>>>>>> ACID
>>>>> >>>>>>>> guarantee. ACID relates to transaction, which Cassandra doesn't
>>>>> >>>>>>>> support. If
>>>>> >>>>>>>> we're talking about the guarantees of transaction, then by all
>>>>> >>>>>>>> means,
>>>>> >>>>>>>> cassandra won't provide it. And yes you can use cages or the like
>>>>> >>>>>>>> to get
>>>>> >>>>>>>> transaction. But that was not the point of the thread, was it ?
>>>>> >>>>>>>> The thread
>>>>> >>>>>>>> is about vector clocks, and that has nothing to do with
>>>>> >>>>>>>> transaction (vector
>>>>> >>>>>>>> clocks certainly don't give you transactions).
>>>>> >>>>>>>> Sorry if I wasn't clear in my mail, but I was only responding to
>>>>> >>>>>>>> why
>>>>> >>>>>>>> so far I don't think vector clocks would really provide much for
>>>>> >>>>>>>> Cassandra.
>>>>> >>>>>>>> --
>>>>> >>>>>>>> Sylvain
>>>>> >>>>>>>>
>>>>> >>>>>>>>>
>>>>> >>>>>>>>> -JA
>>>>> >>>>>>>>>
>>>>> >>>>>>>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne
>>>>> >>>>>>>>> <sy...@datastax.com> wrote:
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John
>>>>> >>>>>>>>>> <ch...@gmail.com> wrote:
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> Apologies : For some reason my response on the original mail
>>>>> >>>>>>>>>>> keeps bouncing back, thus this new one!
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> > From the other hand, the same article says:
>>>>> >>>>>>>>>>> > "For conditional writes to work, the condition must be
>>>>> >>>>>>>>>>> > evaluated at all update
>>>>> >>>>>>>>>>> > sites before the write can be allowed to succeed."
>>>>> >>>>>>>>>>> >
>>>>> >>>>>>>>>>> > This means, that when doing such an update CL=ALL must be
>>>>> >>>>>>>>>>> > used
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> Sorry, but I am confused by that entire thread!
>>>>> >>>>>>>>>>> Questions:-
>>>>> >>>>>>>>>>> 1. Does Cassandra implement any kind of data locking - at any
>>>>> >>>>>>>>>>> granularity whether it be row/colF/Col ?
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>> No locking, no.
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> 2. If the answer to 1 above is NO! - how does CL ALL prevent
>>>>> >>>>>>>>>>> conflicts. Concurrent updates on exactly the same piece of
>>>>> >>>>>>>>>>> data on different
>>>>> >>>>>>>>>>> nodes can still mess each other up, right ?
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>> Not sure why you are taking CL.ALL specifically. But in any CL,
>>>>> >>>>>>>>>> updating the same piece of data means the same column value. In
>>>>> >>>>>>>>>> that case,
>>>>> >>>>>>>>>> the resolution rules are the following:
>>>>> >>>>>>>>>>   - If the updates have a different timestamp, keep the one
>>>>> >>>>>>>>>> with
>>>>> >>>>>>>>>> the higher timestamp. That is, the more recent of two updates
>>>>> >>>>>>>>>> win.
>>>>> >>>>>>>>>>   - It the timestamps are the same, then it compares the values
>>>>> >>>>>>>>>> (byte comparison) and keep the highest value. This is just to
>>>>> >>>>>>>>>> break ties in
>>>>> >>>>>>>>>> a consistent manner.
>>>>> >>>>>>>>>> So if you do two truly concurrent updates (that is from two
>>>>> >>>>>>>>>> place
>>>>> >>>>>>>>>> at the same instant), then you'll end with one of the update.
>>>>> >>>>>>>>>> This is the
>>>>> >>>>>>>>>> column level.
>>>>> >>>>>>>>>> However, if that simple conflict detection/resolution mechanism
>>>>> >>>>>>>>>> is
>>>>> >>>>>>>>>> not good enough for some of your use case and you need to keep
>>>>> >>>>>>>>>> two
>>>>> >>>>>>>>>> concurrent updates, it is easy enough. Just make sure that the
>>>>> >>>>>>>>>> update don't
>>>>> >>>>>>>>>> end up in the same column. This is easily achieved by appending
>>>>> >>>>>>>>>> some unique
>>>>> >>>>>>>>>> identifier to the column name for instance. And when reading,
>>>>> >>>>>>>>>> do a slice and
>>>>> >>>>>>>>>> reconcile whatever you get back with whatever logic make sense.
>>>>> >>>>>>>>>> If you do
>>>>> >>>>>>>>>> that, congrats, you've roughly emulated what vector clocks
>>>>> >>>>>>>>>> would do. Btw, no
>>>>> >>>>>>>>>> locking or anything needed.
>>>>> >>>>>>>>>> In my experience, for most things the timestamp resolution is
>>>>> >>>>>>>>>> enough. If the same user update twice it's profile picture on
>>>>> >>>>>>>>>> you web site
>>>>> >>>>>>>>>> at the same microsecond, it's usually fine to end up with one
>>>>> >>>>>>>>>> of the two
>>>>> >>>>>>>>>> pictures. In the rare case where you need something more
>>>>> >>>>>>>>>> specific, using the
>>>>> >>>>>>>>>> cassandra data model usually solves the problem easily. The
>>>>> >>>>>>>>>> reason for not
>>>>> >>>>>>>>>> having vector clocks in Cassandra is that so far, we haven't
>>>>> >>>>>>>>>> really found
>>>>> >>>>>>>>>> much example where it is no the case.
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>> --
>>>>> >>>>>>>>>> Sylvain
>>>>> >>>>>>>>>
>>>>> >>>>>>>>
>>>>> >>>>>>>
>>>>> >>>>>>
>>>>> >>>>>
>>>>> >>>>
>>>>> >>>
>>>>> >>
>>>>> >
>>>>> >
>>>>
>>>>
>>>
>>
>>
>> Just to make a note the "EVENTUAL" in eventual consistency could be a
>> time that is less then 1ms.
>>
>> I have a program that demonstrates that "eventual" means if i write
>> data at the weakest level, and read it back from a random another node
>> as soon as possible. 99% I see the update. I can share the code if you
>> would like.
>>
>> Remember http://en.wikipedia.org/wiki/Spacetime
>> ...but there is no reference frame in which the two events can occur
>> at the same time...
>>
>> As to MongoDB references ....Yes! most of the noSQL work differently.
>> They each approach CAP
>> http://www.julianbrowne.com/article/viewer/brewers-cap-theorem in a
>> different way.
>>
>> Cassandra does not lock (it is no secret). But remember, you can not
>> have it all pick 2/3 from CAP.
>>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: New Chain for : Does Cassandra use vector clocks

Posted by A J <s5...@gmail.com>.

He has a product to sell, so you can expect some advertising. But in
general, Stonebraker's articles are very deep (another one that
challenges general conceptions is
http://voltdb.com/voltdb-webinar-sql-urban-myths ) . He is the creator
of Postgres and considered a guru in databases by many.
And actually if you cannot let go of ACID and not satisfied with
traditional DBMS solutions, voltdb is worth considering. It ofcourse
solves a different problem(oltp) than what Cassandra does.


On Thu, Feb 24, 2011 at 5:20 PM, Edward Capriolo <ed...@gmail.com> wrote:
> On Thu, Feb 24, 2011 at 3:56 PM, A J <s5...@gmail.com> wrote:
>> While we are at it, there's more to consider than just CAP in distributed :)
>> http://voltdb.com/blog/clarifications-cap-theorem-and-data-related-errors
>>
>> On Thu, Feb 24, 2011 at 3:31 PM, Edward Capriolo <ed...@gmail.com> wrote:
>>> On Thu, Feb 24, 2011 at 3:03 PM, A J <s5...@gmail.com> wrote:
>>>> yes, that is difficult to digest and one has to be sure if the use
>>>> case can afford it.
>>>>
>>>> Some other NOSQL databases deals with it differently (though I don't
>>>> think any of them use atomic 2-phase commit). MongoDB for example will
>>>> ask you to read from the node you wrote first (primary node) unless
>>>> you are ok with eventual consistency. If the write did not make to
>>>> majority of other nodes, it will be rolled-back from the original
>>>> primary when it comes up again as a secondary.
>>>> In some cases, you still could server either new value (that was
>>>> returned as failed) or the old one. But it is different from Cassandra
>>>> in the sense that Cassandra will never rollback.
>>>>
>>>>
>>>>
>>>> On Thu, Feb 24, 2011 at 2:47 PM, Anthony John <ch...@gmail.com> wrote:
>>>>> The leap of faith here is that an error does not mean a clean backing out to
>>>>> prior state - as we are used to with databases. It means that the operation
>>>>> in error could have gone through partially
>>>>>
>>>>> Again, this is not an absolutely unfamiliar territory and can be dealt with.
>>>>> -JA
>>>>> On Thu, Feb 24, 2011 at 1:16 PM, A J <s5...@gmail.com> wrote:
>>>>>>
>>>>>> >>but could be broken in case of a failed write<<
>>>>>> You can think of a scenario where R + W >N still leads to
>>>>>> inconsistency even for successful writes. Say you keep W=1 and R=N .
>>>>>> Lets say the one node where a write happened with success goes down
>>>>>> before it made to the other N-1 nodes. Lets say it goes down for good
>>>>>> and is unrecoverable. The only option is to build a new node from
>>>>>> scratch from other active nodes. This will lead to a write that was
>>>>>> lost and you will end up serving stale copy of it.
>>>>>>
>>>>>> It is better to talk in terms of use cases and if cassandra will be a
>>>>>> fit for it. Otherwise unless you have W=R=N and fsync before each
>>>>>> write commit, there will be scope for inconsistency.
>>>>>>
>>>>>>
>>>>>> On Thu, Feb 24, 2011 at 1:25 PM, Anthony John <ch...@gmail.com>
>>>>>> wrote:
>>>>>> > I see the point - apologies for putting everyone through this!
>>>>>> > It was just militating against my mental model.
>>>>>> > In summary, here is my take away - simple stuff but - IMO - important to
>>>>>> > conclude this thread (I hope):-
>>>>>> > 1. I was splitting hair over a failed ( partial ) Q Write. Such an event
>>>>>> > should be immediately followed by the same write going to a connection
>>>>>> > on to
>>>>>> > another node ( potentially using connection caches of client
>>>>>> > implementations
>>>>>> > ) or a Read at CL of All. Because a write could have partially gone
>>>>>> > through.
>>>>>> > 2. Timestamps are used in determining the latest version ( correcting
>>>>>> > the
>>>>>> > false impression I was propagating)
>>>>>> > Finally, wrt "W + R > N for Q CL statement" holds, but could be broken
>>>>>> > in
>>>>>> > case of a failed write as it is unsure whether the new value got written
>>>>>> > on
>>>>>> >  any server or not. Is that a fair characterization ?
>>>>>> > Bottom line - unlike traditional DBMS, errors do not ensure automatic
>>>>>> > cleanup and revert back, app code has to follow up if  immediate - and
>>>>>> > not
>>>>>> > eventual -  consistency is desired. I made that leap in almost all cases
>>>>>> > - I
>>>>>> > think - but the case of a failed write.
>>>>>> > My bad and I can live with this!
>>>>>> > Regards,
>>>>>> > -JA
>>>>>> >
>>>>>> > On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne
>>>>>> > <sy...@datastax.com>
>>>>>> > wrote:
>>>>>> >>
>>>>>> >> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John <ch...@gmail.com>
>>>>>> >> wrote:
>>>>>> >>>
>>>>>> >>> Completely understand!
>>>>>> >>> All that I am quibbling over is whether a CL of quorum guarantees
>>>>>> >>> consistency or not. That is what the documentation says - right. IF
>>>>>> >>> for a CL
>>>>>> >>> of Q read - it depends on which node returns read first to determine
>>>>>> >>> the
>>>>>> >>> actual returned result or other more convoluted conditions , then a
>>>>>> >>> Quorum
>>>>>> >>> read/write is not consistent, by any definition.
>>>>>> >>
>>>>>> >> But that's the point. The definition of consistency we are talking
>>>>>> >> about
>>>>>> >> has no meaning if you consider only a quorum read. The definition
>>>>>> >> (which is
>>>>>> >> the de facto definition of consistency in 'eventually consistent') make
>>>>>> >> sense if we talk about a write followed by a read. And it is
>>>>>> >> considering succeeding write followed by succeeding read.
>>>>>> >> And that is the statement the wiki is making.
>>>>>> >> Honestly, we could debate forever on the definition of consistency and
>>>>>> >> whatnot. Cassandra guaranties that if you do a (succeeding) write on W
>>>>>> >> replica and then a (succeeding) read on R replica and if R+W>N, then it
>>>>>> >> is
>>>>>> >> guaranteed that the read will see the preceding write. And this is what
>>>>>> >> is
>>>>>> >> called consistency in the context of eventual consistency (which is not
>>>>>> >> the
>>>>>> >> context of ACID).
>>>>>> >> If this is not the definition of consistency you had in mind then by
>>>>>> >> all
>>>>>> >> mean, Cassandra probably don't guarantee this definition. But given
>>>>>> >> that the
>>>>>> >> paragraph preceding what you pasted state clearly we are not talking
>>>>>> >> about
>>>>>> >> ACID consistency, but eventual consistency, I don't think the wiki is
>>>>>> >> making
>>>>>> >> any unfair statement.
>>>>>> >> That being said, the wiki may not be always as clear as it could. But
>>>>>> >> it's
>>>>>> >> an editable wiki :)
>>>>>> >> --
>>>>>> >> Sylvain
>>>>>> >>
>>>>>> >>>
>>>>>> >>> I can still use Cassandra, and will use it, luv it!!! But let us not
>>>>>> >>> make
>>>>>> >>> this statement on the Wiki architecture section:-
>>>>>> >>> -------------------------------------------------------------
>>>>>> >>>
>>>>>> >>> More specifically: R=read replica count W=write replica
>>>>>> >>> count N=replication factor Q=QUORUM (Q = N / 2 + 1)
>>>>>> >>>
>>>>>> >>> If W + R > N, you will have consistency
>>>>>> >>>
>>>>>> >>> W=1, R=N
>>>>>> >>> W=N, R=1
>>>>>> >>> W=Q, R=Q where Q = N / 2 + 1
>>>>>> >>>
>>>>>> >>> Cassandra provides consistency when R + W > N (read replica count
>>>>>> >>> + write
>>>>>> >>> replica count > replication factor).
>>>>>> >>>
>>>>>> >>> ----------------------------------------------------
>>>>>> >>>
>>>>>> >>> .
>>>>>> >>>
>>>>>> >>> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne
>>>>>> >>> <sy...@datastax.com>
>>>>>> >>> wrote:
>>>>>> >>>>
>>>>>> >>>> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John <ch...@gmail.com>
>>>>>> >>>> wrote:
>>>>>> >>>>>
>>>>>> >>>>> If you are correct and you are probably closer to the code - then CL
>>>>>> >>>>> of
>>>>>> >>>>> Quorum does not guarantee a consistency.
>>>>>> >>>>
>>>>>> >>>> If the operation succeed, it does (for some definition of consistency
>>>>>> >>>> which is, following reads at Quorum will be guaranteed to see the new
>>>>>> >>>> value
>>>>>> >>>> of a update at quorum). If it fails, then no, it does not guarantee
>>>>>> >>>> consistency.
>>>>>> >>>> It is important to note that the word consistency has multiple
>>>>>> >>>> meaning.
>>>>>> >>>> In particular, when we are talking of consistency in Cassandra, we
>>>>>> >>>> are not
>>>>>> >>>> talking of the same definition as the C in ACID
>>>>>> >>>>
>>>>>> >>>> (see: http://www.allthingsdistributed.com/2007/12/eventually_consistent.html)
>>>>>> >>>>>
>>>>>> >>>>> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne
>>>>>> >>>>> <sy...@datastax.com> wrote:
>>>>>> >>>>>>
>>>>>> >>>>>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John
>>>>>> >>>>>> <ch...@gmail.com>
>>>>>> >>>>>> wrote:
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> >>Time stamps are not used for conflict resolution - unless is is
>>>>>> >>>>>>>> >> part of the application logic!!!
>>>>>> >>>>>>>
>>>>>> >>>>>>> >>What is you definition of conflict resolution ? Because if you
>>>>>> >>>>>>> >> update twice the same column (which
>>>>>> >>>>>>> >>I'll call a conflict), then the timestamps are used to decide
>>>>>> >>>>>>> >> which
>>>>>> >>>>>>> >> update wins (which I'll call a resolution).
>>>>>> >>>>>>> I understand what you are saying, and yes semantics is very
>>>>>> >>>>>>> important
>>>>>> >>>>>>> here. And yes we are responding to the immediate questions without
>>>>>> >>>>>>> covering
>>>>>> >>>>>>> all questions in the thread.
>>>>>> >>>>>>> The point being made here is that the timestamp of the column is
>>>>>> >>>>>>> not
>>>>>> >>>>>>> used by Cassandra to figure out what data to return.
>>>>>> >>>>>>
>>>>>> >>>>>> Not quite true.
>>>>>> >>>>>>>
>>>>>> >>>>>>> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
>>>>>> >>>>>>> A Quorum  Write comes and add/updates the time stamp (TS2) of a
>>>>>> >>>>>>> particular data element. It succeeds on N1 - fails on N2/3. So the
>>>>>> >>>>>>> write is
>>>>>> >>>>>>> returned as failed - right ?
>>>>>> >>>>>>> Now Quorum read comes in for exactly the same piece of data that
>>>>>> >>>>>>> the
>>>>>> >>>>>>> write failed for.
>>>>>> >>>>>>> So N1 has TS2 but both N2/3 have the old TS (say TS1)
>>>>>> >>>>>>> And the read succeeds - Will it return TS1 or TS2.
>>>>>> >>>>>>> I submit it will return TS1 - the old TS.
>>>>>> >>>>>>
>>>>>> >>>>>> It all depends on which (first 2) nodes respond to the read (since
>>>>>> >>>>>> RF=3, that can any two of N1/N2/N3). If N1 is part of the two that
>>>>>> >>>>>> makes the
>>>>>> >>>>>> quorum, then TS2 will be returned, because cassandra will compare
>>>>>> >>>>>> the
>>>>>> >>>>>> timestamp and decide what to return based on this. If N2/N3
>>>>>> >>>>>> responds
>>>>>> >>>>>> however, both timestamp will be TS1 and so, after timestamp
>>>>>> >>>>>> resolution, it
>>>>>> >>>>>> will stil be TS1 that will be returned.
>>>>>> >>>>>> So yes timestamp is used for conflict resolution.
>>>>>> >>>>>> In your example, you could get TS1 back because a failed write can
>>>>>> >>>>>> let
>>>>>> >>>>>> you cluster in an inconsistent state. You'd have to retry the
>>>>>> >>>>>> quorum and
>>>>>> >>>>>> only when it succeeds can you be guaranteed that quorum read will
>>>>>> >>>>>> always
>>>>>> >>>>>> return TS2.
>>>>>> >>>>>> This is because when a write fails, Cassandra doesn't guarantee
>>>>>> >>>>>> that
>>>>>> >>>>>> the write did not made it in (there is no revert).
>>>>>> >>>>>>
>>>>>> >>>>>>>
>>>>>> >>>>>>> Are we on the same page with this interpretation ?
>>>>>> >>>>>>> Regards,
>>>>>> >>>>>>> -JA
>>>>>> >>>>>>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne
>>>>>> >>>>>>> <sy...@datastax.com> wrote:
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John
>>>>>> >>>>>>>> <ch...@gmail.com> wrote:
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>> Sylvan,
>>>>>> >>>>>>>>> Time stamps are not used for conflict resolution - unless is is
>>>>>> >>>>>>>>> part of the application logic!!!
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> What is you definition of conflict resolution ? Because if you
>>>>>> >>>>>>>> update twice the same column (which
>>>>>> >>>>>>>> I'll call a conflict), then the timestamps are used to decide
>>>>>> >>>>>>>> which
>>>>>> >>>>>>>> update wins (which I'll call a resolution).
>>>>>> >>>>>>>>
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>> You can have "lost updates" w/Cassandra. You need to to use 3rd
>>>>>> >>>>>>>>> products - cages for e.g. - to get ACID type consistency.
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> Then again, you'll have to define what you are calling "lost
>>>>>> >>>>>>>> updates". Provided you use a reasonable consistency level,
>>>>>> >>>>>>>> Cassandra
>>>>>> >>>>>>>> provides fairly strong durability guarantee, so for some
>>>>>> >>>>>>>> definition you
>>>>>> >>>>>>>> don't "lose updates".
>>>>>> >>>>>>>> That being said, I never pretended that Cassandra provided any
>>>>>> >>>>>>>> ACID
>>>>>> >>>>>>>> guarantee. ACID relates to transaction, which Cassandra doesn't
>>>>>> >>>>>>>> support. If
>>>>>> >>>>>>>> we're talking about the guarantees of transaction, then by all
>>>>>> >>>>>>>> means,
>>>>>> >>>>>>>> cassandra won't provide it. And yes you can use cages or the like
>>>>>> >>>>>>>> to get
>>>>>> >>>>>>>> transaction. But that was not the point of the thread, was it ?
>>>>>> >>>>>>>> The thread
>>>>>> >>>>>>>> is about vector clocks, and that has nothing to do with
>>>>>> >>>>>>>> transaction (vector
>>>>>> >>>>>>>> clocks certainly don't give you transactions).
>>>>>> >>>>>>>> Sorry if I wasn't clear in my mail, but I was only responding to
>>>>>> >>>>>>>> why
>>>>>> >>>>>>>> so far I don't think vector clocks would really provide much for
>>>>>> >>>>>>>> Cassandra.
>>>>>> >>>>>>>> --
>>>>>> >>>>>>>> Sylvain
>>>>>> >>>>>>>>
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>> -JA
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne
>>>>>> >>>>>>>>> <sy...@datastax.com> wrote:
>>>>>> >>>>>>>>>>
>>>>>> >>>>>>>>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John
>>>>>> >>>>>>>>>> <ch...@gmail.com> wrote:
>>>>>> >>>>>>>>>>>
>>>>>> >>>>>>>>>>> Apologies : For some reason my response on the original mail
>>>>>> >>>>>>>>>>> keeps bouncing back, thus this new one!
>>>>>> >>>>>>>>>>>
>>>>>> >>>>>>>>>>> > From the other hand, the same article says:
>>>>>> >>>>>>>>>>> > "For conditional writes to work, the condition must be
>>>>>> >>>>>>>>>>> > evaluated at all update
>>>>>> >>>>>>>>>>> > sites before the write can be allowed to succeed."
>>>>>> >>>>>>>>>>> >
>>>>>> >>>>>>>>>>> > This means, that when doing such an update CL=ALL must be
>>>>>> >>>>>>>>>>> > used
>>>>>> >>>>>>>>>>>
>>>>>> >>>>>>>>>>> Sorry, but I am confused by that entire thread!
>>>>>> >>>>>>>>>>> Questions:-
>>>>>> >>>>>>>>>>> 1. Does Cassandra implement any kind of data locking - at any
>>>>>> >>>>>>>>>>> granularity whether it be row/colF/Col ?
>>>>>> >>>>>>>>>>
>>>>>> >>>>>>>>>> No locking, no.
>>>>>> >>>>>>>>>>
>>>>>> >>>>>>>>>>>
>>>>>> >>>>>>>>>>> 2. If the answer to 1 above is NO! - how does CL ALL prevent
>>>>>> >>>>>>>>>>> conflicts. Concurrent updates on exactly the same piece of
>>>>>> >>>>>>>>>>> data on different
>>>>>> >>>>>>>>>>> nodes can still mess each other up, right ?
>>>>>> >>>>>>>>>>
>>>>>> >>>>>>>>>> Not sure why you are taking CL.ALL specifically. But in any CL,
>>>>>> >>>>>>>>>> updating the same piece of data means the same column value. In
>>>>>> >>>>>>>>>> that case,
>>>>>> >>>>>>>>>> the resolution rules are the following:
>>>>>> >>>>>>>>>>   - If the updates have a different timestamp, keep the one
>>>>>> >>>>>>>>>> with
>>>>>> >>>>>>>>>> the higher timestamp. That is, the more recent of two updates
>>>>>> >>>>>>>>>> win.
>>>>>> >>>>>>>>>>   - It the timestamps are the same, then it compares the values
>>>>>> >>>>>>>>>> (byte comparison) and keep the highest value. This is just to
>>>>>> >>>>>>>>>> break ties in
>>>>>> >>>>>>>>>> a consistent manner.
>>>>>> >>>>>>>>>> So if you do two truly concurrent updates (that is from two
>>>>>> >>>>>>>>>> place
>>>>>> >>>>>>>>>> at the same instant), then you'll end with one of the update.
>>>>>> >>>>>>>>>> This is the
>>>>>> >>>>>>>>>> column level.
>>>>>> >>>>>>>>>> However, if that simple conflict detection/resolution mechanism
>>>>>> >>>>>>>>>> is
>>>>>> >>>>>>>>>> not good enough for some of your use case and you need to keep
>>>>>> >>>>>>>>>> two
>>>>>> >>>>>>>>>> concurrent updates, it is easy enough. Just make sure that the
>>>>>> >>>>>>>>>> update don't
>>>>>> >>>>>>>>>> end up in the same column. This is easily achieved by appending
>>>>>> >>>>>>>>>> some unique
>>>>>> >>>>>>>>>> identifier to the column name for instance. And when reading,
>>>>>> >>>>>>>>>> do a slice and
>>>>>> >>>>>>>>>> reconcile whatever you get back with whatever logic make sense.
>>>>>> >>>>>>>>>> If you do
>>>>>> >>>>>>>>>> that, congrats, you've roughly emulated what vector clocks
>>>>>> >>>>>>>>>> would do. Btw, no
>>>>>> >>>>>>>>>> locking or anything needed.
>>>>>> >>>>>>>>>> In my experience, for most things the timestamp resolution is
>>>>>> >>>>>>>>>> enough. If the same user update twice it's profile picture on
>>>>>> >>>>>>>>>> you web site
>>>>>> >>>>>>>>>> at the same microsecond, it's usually fine to end up with one
>>>>>> >>>>>>>>>> of the two
>>>>>> >>>>>>>>>> pictures. In the rare case where you need something more
>>>>>> >>>>>>>>>> specific, using the
>>>>>> >>>>>>>>>> cassandra data model usually solves the problem easily. The
>>>>>> >>>>>>>>>> reason for not
>>>>>> >>>>>>>>>> having vector clocks in Cassandra is that so far, we haven't
>>>>>> >>>>>>>>>> really found
>>>>>> >>>>>>>>>> much example where it is no the case.
>>>>>> >>>>>>>>>>
>>>>>> >>>>>>>>>> --
>>>>>> >>>>>>>>>> Sylvain
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>
>>>>>> >>>>>>>
>>>>>> >>>>>>
>>>>>> >>>>>
>>>>>> >>>>
>>>>>> >>>
>>>>>> >>
>>>>>> >
>>>>>> >
>>>>>
>>>>>
>>>>
>>>
>>>
>>> Just to make a note the "EVENTUAL" in eventual consistency could be a
>>> time that is less then 1ms.
>>>
>>> I have a program that demonstrates that "eventual" means if i write
>>> data at the weakest level, and read it back from a random another node
>>> as soon as possible. 99% I see the update. I can share the code if you
>>> would like.
>>>
>>> Remember http://en.wikipedia.org/wiki/Spacetime
>>> ...but there is no reference frame in which the two events can occur
>>> at the same time...
>>>
>>> As to MongoDB references ....Yes! most of the noSQL work differently.
>>> They each approach CAP
>>> http://www.julianbrowne.com/article/viewer/brewers-cap-theorem in a
>>> different way.
>>>
>>> Cassandra does not lock (it is no secret). But remember, you can not
>>> have it all pick 2/3 from CAP.
>>>
>>
>
> http://voltdb.com/blog/clarifications-cap-theorem-and-data-related-errors
> I was reading that and many of the points were well taken...up until...
>
> Next generation DBMS technologies, such as VoltDB, have been shown to
> run around 50X the speed of conventional SQL engines.  Thus, if you
> need 200 nodes to support a specific SQL application, then VoltDB can
> probably do the same application on 4 nodes.  The probability of a
> failure on 200 nodes is wildly different than the probability of
> failure on four nodes.
>
> Come on? 200 nodes down to 4? I just can not take it seriously any more.
>

Re: New Chain for : Does Cassandra use vector clocks

Posted by Edward Capriolo <ed...@gmail.com>.

On Thu, Feb 24, 2011 at 3:56 PM, A J <s5...@gmail.com> wrote:
> While we are at it, there's more to consider than just CAP in distributed :)
> http://voltdb.com/blog/clarifications-cap-theorem-and-data-related-errors
>
> On Thu, Feb 24, 2011 at 3:31 PM, Edward Capriolo <ed...@gmail.com> wrote:
>> On Thu, Feb 24, 2011 at 3:03 PM, A J <s5...@gmail.com> wrote:
>>> yes, that is difficult to digest and one has to be sure if the use
>>> case can afford it.
>>>
>>> Some other NOSQL databases deals with it differently (though I don't
>>> think any of them use atomic 2-phase commit). MongoDB for example will
>>> ask you to read from the node you wrote first (primary node) unless
>>> you are ok with eventual consistency. If the write did not make to
>>> majority of other nodes, it will be rolled-back from the original
>>> primary when it comes up again as a secondary.
>>> In some cases, you still could server either new value (that was
>>> returned as failed) or the old one. But it is different from Cassandra
>>> in the sense that Cassandra will never rollback.
>>>
>>>
>>>
>>> On Thu, Feb 24, 2011 at 2:47 PM, Anthony John <ch...@gmail.com> wrote:
>>>> The leap of faith here is that an error does not mean a clean backing out to
>>>> prior state - as we are used to with databases. It means that the operation
>>>> in error could have gone through partially
>>>>
>>>> Again, this is not an absolutely unfamiliar territory and can be dealt with.
>>>> -JA
>>>> On Thu, Feb 24, 2011 at 1:16 PM, A J <s5...@gmail.com> wrote:
>>>>>
>>>>> >>but could be broken in case of a failed write<<
>>>>> You can think of a scenario where R + W >N still leads to
>>>>> inconsistency even for successful writes. Say you keep W=1 and R=N .
>>>>> Lets say the one node where a write happened with success goes down
>>>>> before it made to the other N-1 nodes. Lets say it goes down for good
>>>>> and is unrecoverable. The only option is to build a new node from
>>>>> scratch from other active nodes. This will lead to a write that was
>>>>> lost and you will end up serving stale copy of it.
>>>>>
>>>>> It is better to talk in terms of use cases and if cassandra will be a
>>>>> fit for it. Otherwise unless you have W=R=N and fsync before each
>>>>> write commit, there will be scope for inconsistency.
>>>>>
>>>>>
>>>>> On Thu, Feb 24, 2011 at 1:25 PM, Anthony John <ch...@gmail.com>
>>>>> wrote:
>>>>> > I see the point - apologies for putting everyone through this!
>>>>> > It was just militating against my mental model.
>>>>> > In summary, here is my take away - simple stuff but - IMO - important to
>>>>> > conclude this thread (I hope):-
>>>>> > 1. I was splitting hair over a failed ( partial ) Q Write. Such an event
>>>>> > should be immediately followed by the same write going to a connection
>>>>> > on to
>>>>> > another node ( potentially using connection caches of client
>>>>> > implementations
>>>>> > ) or a Read at CL of All. Because a write could have partially gone
>>>>> > through.
>>>>> > 2. Timestamps are used in determining the latest version ( correcting
>>>>> > the
>>>>> > false impression I was propagating)
>>>>> > Finally, wrt "W + R > N for Q CL statement" holds, but could be broken
>>>>> > in
>>>>> > case of a failed write as it is unsure whether the new value got written
>>>>> > on
>>>>> >  any server or not. Is that a fair characterization ?
>>>>> > Bottom line - unlike traditional DBMS, errors do not ensure automatic
>>>>> > cleanup and revert back, app code has to follow up if  immediate - and
>>>>> > not
>>>>> > eventual -  consistency is desired. I made that leap in almost all cases
>>>>> > - I
>>>>> > think - but the case of a failed write.
>>>>> > My bad and I can live with this!
>>>>> > Regards,
>>>>> > -JA
>>>>> >
>>>>> > On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne
>>>>> > <sy...@datastax.com>
>>>>> > wrote:
>>>>> >>
>>>>> >> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John <ch...@gmail.com>
>>>>> >> wrote:
>>>>> >>>
>>>>> >>> Completely understand!
>>>>> >>> All that I am quibbling over is whether a CL of quorum guarantees
>>>>> >>> consistency or not. That is what the documentation says - right. IF
>>>>> >>> for a CL
>>>>> >>> of Q read - it depends on which node returns read first to determine
>>>>> >>> the
>>>>> >>> actual returned result or other more convoluted conditions , then a
>>>>> >>> Quorum
>>>>> >>> read/write is not consistent, by any definition.
>>>>> >>
>>>>> >> But that's the point. The definition of consistency we are talking
>>>>> >> about
>>>>> >> has no meaning if you consider only a quorum read. The definition
>>>>> >> (which is
>>>>> >> the de facto definition of consistency in 'eventually consistent') make
>>>>> >> sense if we talk about a write followed by a read. And it is
>>>>> >> considering succeeding write followed by succeeding read.
>>>>> >> And that is the statement the wiki is making.
>>>>> >> Honestly, we could debate forever on the definition of consistency and
>>>>> >> whatnot. Cassandra guaranties that if you do a (succeeding) write on W
>>>>> >> replica and then a (succeeding) read on R replica and if R+W>N, then it
>>>>> >> is
>>>>> >> guaranteed that the read will see the preceding write. And this is what
>>>>> >> is
>>>>> >> called consistency in the context of eventual consistency (which is not
>>>>> >> the
>>>>> >> context of ACID).
>>>>> >> If this is not the definition of consistency you had in mind then by
>>>>> >> all
>>>>> >> mean, Cassandra probably don't guarantee this definition. But given
>>>>> >> that the
>>>>> >> paragraph preceding what you pasted state clearly we are not talking
>>>>> >> about
>>>>> >> ACID consistency, but eventual consistency, I don't think the wiki is
>>>>> >> making
>>>>> >> any unfair statement.
>>>>> >> That being said, the wiki may not be always as clear as it could. But
>>>>> >> it's
>>>>> >> an editable wiki :)
>>>>> >> --
>>>>> >> Sylvain
>>>>> >>
>>>>> >>>
>>>>> >>> I can still use Cassandra, and will use it, luv it!!! But let us not
>>>>> >>> make
>>>>> >>> this statement on the Wiki architecture section:-
>>>>> >>> -------------------------------------------------------------
>>>>> >>>
>>>>> >>> More specifically: R=read replica count W=write replica
>>>>> >>> count N=replication factor Q=QUORUM (Q = N / 2 + 1)
>>>>> >>>
>>>>> >>> If W + R > N, you will have consistency
>>>>> >>>
>>>>> >>> W=1, R=N
>>>>> >>> W=N, R=1
>>>>> >>> W=Q, R=Q where Q = N / 2 + 1
>>>>> >>>
>>>>> >>> Cassandra provides consistency when R + W > N (read replica count
>>>>> >>> + write
>>>>> >>> replica count > replication factor).
>>>>> >>>
>>>>> >>> ----------------------------------------------------
>>>>> >>>
>>>>> >>> .
>>>>> >>>
>>>>> >>> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne
>>>>> >>> <sy...@datastax.com>
>>>>> >>> wrote:
>>>>> >>>>
>>>>> >>>> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John <ch...@gmail.com>
>>>>> >>>> wrote:
>>>>> >>>>>
>>>>> >>>>> If you are correct and you are probably closer to the code - then CL
>>>>> >>>>> of
>>>>> >>>>> Quorum does not guarantee a consistency.
>>>>> >>>>
>>>>> >>>> If the operation succeed, it does (for some definition of consistency
>>>>> >>>> which is, following reads at Quorum will be guaranteed to see the new
>>>>> >>>> value
>>>>> >>>> of a update at quorum). If it fails, then no, it does not guarantee
>>>>> >>>> consistency.
>>>>> >>>> It is important to note that the word consistency has multiple
>>>>> >>>> meaning.
>>>>> >>>> In particular, when we are talking of consistency in Cassandra, we
>>>>> >>>> are not
>>>>> >>>> talking of the same definition as the C in ACID
>>>>> >>>>
>>>>> >>>> (see: http://www.allthingsdistributed.com/2007/12/eventually_consistent.html)
>>>>> >>>>>
>>>>> >>>>> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne
>>>>> >>>>> <sy...@datastax.com> wrote:
>>>>> >>>>>>
>>>>> >>>>>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John
>>>>> >>>>>> <ch...@gmail.com>
>>>>> >>>>>> wrote:
>>>>> >>>>>>>>
>>>>> >>>>>>>> >>Time stamps are not used for conflict resolution - unless is is
>>>>> >>>>>>>> >> part of the application logic!!!
>>>>> >>>>>>>
>>>>> >>>>>>> >>What is you definition of conflict resolution ? Because if you
>>>>> >>>>>>> >> update twice the same column (which
>>>>> >>>>>>> >>I'll call a conflict), then the timestamps are used to decide
>>>>> >>>>>>> >> which
>>>>> >>>>>>> >> update wins (which I'll call a resolution).
>>>>> >>>>>>> I understand what you are saying, and yes semantics is very
>>>>> >>>>>>> important
>>>>> >>>>>>> here. And yes we are responding to the immediate questions without
>>>>> >>>>>>> covering
>>>>> >>>>>>> all questions in the thread.
>>>>> >>>>>>> The point being made here is that the timestamp of the column is
>>>>> >>>>>>> not
>>>>> >>>>>>> used by Cassandra to figure out what data to return.
>>>>> >>>>>>
>>>>> >>>>>> Not quite true.
>>>>> >>>>>>>
>>>>> >>>>>>> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
>>>>> >>>>>>> A Quorum  Write comes and add/updates the time stamp (TS2) of a
>>>>> >>>>>>> particular data element. It succeeds on N1 - fails on N2/3. So the
>>>>> >>>>>>> write is
>>>>> >>>>>>> returned as failed - right ?
>>>>> >>>>>>> Now Quorum read comes in for exactly the same piece of data that
>>>>> >>>>>>> the
>>>>> >>>>>>> write failed for.
>>>>> >>>>>>> So N1 has TS2 but both N2/3 have the old TS (say TS1)
>>>>> >>>>>>> And the read succeeds - Will it return TS1 or TS2.
>>>>> >>>>>>> I submit it will return TS1 - the old TS.
>>>>> >>>>>>
>>>>> >>>>>> It all depends on which (first 2) nodes respond to the read (since
>>>>> >>>>>> RF=3, that can any two of N1/N2/N3). If N1 is part of the two that
>>>>> >>>>>> makes the
>>>>> >>>>>> quorum, then TS2 will be returned, because cassandra will compare
>>>>> >>>>>> the
>>>>> >>>>>> timestamp and decide what to return based on this. If N2/N3
>>>>> >>>>>> responds
>>>>> >>>>>> however, both timestamp will be TS1 and so, after timestamp
>>>>> >>>>>> resolution, it
>>>>> >>>>>> will stil be TS1 that will be returned.
>>>>> >>>>>> So yes timestamp is used for conflict resolution.
>>>>> >>>>>> In your example, you could get TS1 back because a failed write can
>>>>> >>>>>> let
>>>>> >>>>>> you cluster in an inconsistent state. You'd have to retry the
>>>>> >>>>>> quorum and
>>>>> >>>>>> only when it succeeds can you be guaranteed that quorum read will
>>>>> >>>>>> always
>>>>> >>>>>> return TS2.
>>>>> >>>>>> This is because when a write fails, Cassandra doesn't guarantee
>>>>> >>>>>> that
>>>>> >>>>>> the write did not made it in (there is no revert).
>>>>> >>>>>>
>>>>> >>>>>>>
>>>>> >>>>>>> Are we on the same page with this interpretation ?
>>>>> >>>>>>> Regards,
>>>>> >>>>>>> -JA
>>>>> >>>>>>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne
>>>>> >>>>>>> <sy...@datastax.com> wrote:
>>>>> >>>>>>>>
>>>>> >>>>>>>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John
>>>>> >>>>>>>> <ch...@gmail.com> wrote:
>>>>> >>>>>>>>>
>>>>> >>>>>>>>> Sylvan,
>>>>> >>>>>>>>> Time stamps are not used for conflict resolution - unless is is
>>>>> >>>>>>>>> part of the application logic!!!
>>>>> >>>>>>>>
>>>>> >>>>>>>> What is you definition of conflict resolution ? Because if you
>>>>> >>>>>>>> update twice the same column (which
>>>>> >>>>>>>> I'll call a conflict), then the timestamps are used to decide
>>>>> >>>>>>>> which
>>>>> >>>>>>>> update wins (which I'll call a resolution).
>>>>> >>>>>>>>
>>>>> >>>>>>>>>
>>>>> >>>>>>>>> You can have "lost updates" w/Cassandra. You need to to use 3rd
>>>>> >>>>>>>>> products - cages for e.g. - to get ACID type consistency.
>>>>> >>>>>>>>
>>>>> >>>>>>>> Then again, you'll have to define what you are calling "lost
>>>>> >>>>>>>> updates". Provided you use a reasonable consistency level,
>>>>> >>>>>>>> Cassandra
>>>>> >>>>>>>> provides fairly strong durability guarantee, so for some
>>>>> >>>>>>>> definition you
>>>>> >>>>>>>> don't "lose updates".
>>>>> >>>>>>>> That being said, I never pretended that Cassandra provided any
>>>>> >>>>>>>> ACID
>>>>> >>>>>>>> guarantee. ACID relates to transaction, which Cassandra doesn't
>>>>> >>>>>>>> support. If
>>>>> >>>>>>>> we're talking about the guarantees of transaction, then by all
>>>>> >>>>>>>> means,
>>>>> >>>>>>>> cassandra won't provide it. And yes you can use cages or the like
>>>>> >>>>>>>> to get
>>>>> >>>>>>>> transaction. But that was not the point of the thread, was it ?
>>>>> >>>>>>>> The thread
>>>>> >>>>>>>> is about vector clocks, and that has nothing to do with
>>>>> >>>>>>>> transaction (vector
>>>>> >>>>>>>> clocks certainly don't give you transactions).
>>>>> >>>>>>>> Sorry if I wasn't clear in my mail, but I was only responding to
>>>>> >>>>>>>> why
>>>>> >>>>>>>> so far I don't think vector clocks would really provide much for
>>>>> >>>>>>>> Cassandra.
>>>>> >>>>>>>> --
>>>>> >>>>>>>> Sylvain
>>>>> >>>>>>>>
>>>>> >>>>>>>>>
>>>>> >>>>>>>>> -JA
>>>>> >>>>>>>>>
>>>>> >>>>>>>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne
>>>>> >>>>>>>>> <sy...@datastax.com> wrote:
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John
>>>>> >>>>>>>>>> <ch...@gmail.com> wrote:
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> Apologies : For some reason my response on the original mail
>>>>> >>>>>>>>>>> keeps bouncing back, thus this new one!
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> > From the other hand, the same article says:
>>>>> >>>>>>>>>>> > "For conditional writes to work, the condition must be
>>>>> >>>>>>>>>>> > evaluated at all update
>>>>> >>>>>>>>>>> > sites before the write can be allowed to succeed."
>>>>> >>>>>>>>>>> >
>>>>> >>>>>>>>>>> > This means, that when doing such an update CL=ALL must be
>>>>> >>>>>>>>>>> > used
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> Sorry, but I am confused by that entire thread!
>>>>> >>>>>>>>>>> Questions:-
>>>>> >>>>>>>>>>> 1. Does Cassandra implement any kind of data locking - at any
>>>>> >>>>>>>>>>> granularity whether it be row/colF/Col ?
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>> No locking, no.
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>>>
>>>>> >>>>>>>>>>> 2. If the answer to 1 above is NO! - how does CL ALL prevent
>>>>> >>>>>>>>>>> conflicts. Concurrent updates on exactly the same piece of
>>>>> >>>>>>>>>>> data on different
>>>>> >>>>>>>>>>> nodes can still mess each other up, right ?
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>> Not sure why you are taking CL.ALL specifically. But in any CL,
>>>>> >>>>>>>>>> updating the same piece of data means the same column value. In
>>>>> >>>>>>>>>> that case,
>>>>> >>>>>>>>>> the resolution rules are the following:
>>>>> >>>>>>>>>>   - If the updates have a different timestamp, keep the one
>>>>> >>>>>>>>>> with
>>>>> >>>>>>>>>> the higher timestamp. That is, the more recent of two updates
>>>>> >>>>>>>>>> win.
>>>>> >>>>>>>>>>   - It the timestamps are the same, then it compares the values
>>>>> >>>>>>>>>> (byte comparison) and keep the highest value. This is just to
>>>>> >>>>>>>>>> break ties in
>>>>> >>>>>>>>>> a consistent manner.
>>>>> >>>>>>>>>> So if you do two truly concurrent updates (that is from two
>>>>> >>>>>>>>>> place
>>>>> >>>>>>>>>> at the same instant), then you'll end with one of the update.
>>>>> >>>>>>>>>> This is the
>>>>> >>>>>>>>>> column level.
>>>>> >>>>>>>>>> However, if that simple conflict detection/resolution mechanism
>>>>> >>>>>>>>>> is
>>>>> >>>>>>>>>> not good enough for some of your use case and you need to keep
>>>>> >>>>>>>>>> two
>>>>> >>>>>>>>>> concurrent updates, it is easy enough. Just make sure that the
>>>>> >>>>>>>>>> update don't
>>>>> >>>>>>>>>> end up in the same column. This is easily achieved by appending
>>>>> >>>>>>>>>> some unique
>>>>> >>>>>>>>>> identifier to the column name for instance. And when reading,
>>>>> >>>>>>>>>> do a slice and
>>>>> >>>>>>>>>> reconcile whatever you get back with whatever logic make sense.
>>>>> >>>>>>>>>> If you do
>>>>> >>>>>>>>>> that, congrats, you've roughly emulated what vector clocks
>>>>> >>>>>>>>>> would do. Btw, no
>>>>> >>>>>>>>>> locking or anything needed.
>>>>> >>>>>>>>>> In my experience, for most things the timestamp resolution is
>>>>> >>>>>>>>>> enough. If the same user update twice it's profile picture on
>>>>> >>>>>>>>>> you web site
>>>>> >>>>>>>>>> at the same microsecond, it's usually fine to end up with one
>>>>> >>>>>>>>>> of the two
>>>>> >>>>>>>>>> pictures. In the rare case where you need something more
>>>>> >>>>>>>>>> specific, using the
>>>>> >>>>>>>>>> cassandra data model usually solves the problem easily. The
>>>>> >>>>>>>>>> reason for not
>>>>> >>>>>>>>>> having vector clocks in Cassandra is that so far, we haven't
>>>>> >>>>>>>>>> really found
>>>>> >>>>>>>>>> much example where it is no the case.
>>>>> >>>>>>>>>>
>>>>> >>>>>>>>>> --
>>>>> >>>>>>>>>> Sylvain
>>>>> >>>>>>>>>
>>>>> >>>>>>>>
>>>>> >>>>>>>
>>>>> >>>>>>
>>>>> >>>>>
>>>>> >>>>
>>>>> >>>
>>>>> >>
>>>>> >
>>>>> >
>>>>
>>>>
>>>
>>
>>
>> Just to make a note the "EVENTUAL" in eventual consistency could be a
>> time that is less then 1ms.
>>
>> I have a program that demonstrates that "eventual" means if i write
>> data at the weakest level, and read it back from a random another node
>> as soon as possible. 99% I see the update. I can share the code if you
>> would like.
>>
>> Remember http://en.wikipedia.org/wiki/Spacetime
>> ...but there is no reference frame in which the two events can occur
>> at the same time...
>>
>> As to MongoDB references ....Yes! most of the noSQL work differently.
>> They each approach CAP
>> http://www.julianbrowne.com/article/viewer/brewers-cap-theorem in a
>> different way.
>>
>> Cassandra does not lock (it is no secret). But remember, you can not
>> have it all pick 2/3 from CAP.
>>
>

http://voltdb.com/blog/clarifications-cap-theorem-and-data-related-errors
I was reading that and many of the points were well taken...up until...

Next generation DBMS technologies, such as VoltDB, have been shown to
run around 50X the speed of conventional SQL engines.  Thus, if you
need 200 nodes to support a specific SQL application, then VoltDB can
probably do the same application on 4 nodes.  The probability of a
failure on 200 nodes is wildly different than the probability of
failure on four nodes.

Come on? 200 nodes down to 4? I just can not take it seriously any more.

Re: New Chain for : Does Cassandra use vector clocks

Posted by A J <s5...@gmail.com>.

While we are at it, there's more to consider than just CAP in distributed :)
http://voltdb.com/blog/clarifications-cap-theorem-and-data-related-errors

On Thu, Feb 24, 2011 at 3:31 PM, Edward Capriolo <ed...@gmail.com> wrote:
> On Thu, Feb 24, 2011 at 3:03 PM, A J <s5...@gmail.com> wrote:
>> yes, that is difficult to digest and one has to be sure if the use
>> case can afford it.
>>
>> Some other NOSQL databases deals with it differently (though I don't
>> think any of them use atomic 2-phase commit). MongoDB for example will
>> ask you to read from the node you wrote first (primary node) unless
>> you are ok with eventual consistency. If the write did not make to
>> majority of other nodes, it will be rolled-back from the original
>> primary when it comes up again as a secondary.
>> In some cases, you still could server either new value (that was
>> returned as failed) or the old one. But it is different from Cassandra
>> in the sense that Cassandra will never rollback.
>>
>>
>>
>> On Thu, Feb 24, 2011 at 2:47 PM, Anthony John <ch...@gmail.com> wrote:
>>> The leap of faith here is that an error does not mean a clean backing out to
>>> prior state - as we are used to with databases. It means that the operation
>>> in error could have gone through partially
>>>
>>> Again, this is not an absolutely unfamiliar territory and can be dealt with.
>>> -JA
>>> On Thu, Feb 24, 2011 at 1:16 PM, A J <s5...@gmail.com> wrote:
>>>>
>>>> >>but could be broken in case of a failed write<<
>>>> You can think of a scenario where R + W >N still leads to
>>>> inconsistency even for successful writes. Say you keep W=1 and R=N .
>>>> Lets say the one node where a write happened with success goes down
>>>> before it made to the other N-1 nodes. Lets say it goes down for good
>>>> and is unrecoverable. The only option is to build a new node from
>>>> scratch from other active nodes. This will lead to a write that was
>>>> lost and you will end up serving stale copy of it.
>>>>
>>>> It is better to talk in terms of use cases and if cassandra will be a
>>>> fit for it. Otherwise unless you have W=R=N and fsync before each
>>>> write commit, there will be scope for inconsistency.
>>>>
>>>>
>>>> On Thu, Feb 24, 2011 at 1:25 PM, Anthony John <ch...@gmail.com>
>>>> wrote:
>>>> > I see the point - apologies for putting everyone through this!
>>>> > It was just militating against my mental model.
>>>> > In summary, here is my take away - simple stuff but - IMO - important to
>>>> > conclude this thread (I hope):-
>>>> > 1. I was splitting hair over a failed ( partial ) Q Write. Such an event
>>>> > should be immediately followed by the same write going to a connection
>>>> > on to
>>>> > another node ( potentially using connection caches of client
>>>> > implementations
>>>> > ) or a Read at CL of All. Because a write could have partially gone
>>>> > through.
>>>> > 2. Timestamps are used in determining the latest version ( correcting
>>>> > the
>>>> > false impression I was propagating)
>>>> > Finally, wrt "W + R > N for Q CL statement" holds, but could be broken
>>>> > in
>>>> > case of a failed write as it is unsure whether the new value got written
>>>> > on
>>>> >  any server or not. Is that a fair characterization ?
>>>> > Bottom line - unlike traditional DBMS, errors do not ensure automatic
>>>> > cleanup and revert back, app code has to follow up if  immediate - and
>>>> > not
>>>> > eventual -  consistency is desired. I made that leap in almost all cases
>>>> > - I
>>>> > think - but the case of a failed write.
>>>> > My bad and I can live with this!
>>>> > Regards,
>>>> > -JA
>>>> >
>>>> > On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne
>>>> > <sy...@datastax.com>
>>>> > wrote:
>>>> >>
>>>> >> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John <ch...@gmail.com>
>>>> >> wrote:
>>>> >>>
>>>> >>> Completely understand!
>>>> >>> All that I am quibbling over is whether a CL of quorum guarantees
>>>> >>> consistency or not. That is what the documentation says - right. IF
>>>> >>> for a CL
>>>> >>> of Q read - it depends on which node returns read first to determine
>>>> >>> the
>>>> >>> actual returned result or other more convoluted conditions , then a
>>>> >>> Quorum
>>>> >>> read/write is not consistent, by any definition.
>>>> >>
>>>> >> But that's the point. The definition of consistency we are talking
>>>> >> about
>>>> >> has no meaning if you consider only a quorum read. The definition
>>>> >> (which is
>>>> >> the de facto definition of consistency in 'eventually consistent') make
>>>> >> sense if we talk about a write followed by a read. And it is
>>>> >> considering succeeding write followed by succeeding read.
>>>> >> And that is the statement the wiki is making.
>>>> >> Honestly, we could debate forever on the definition of consistency and
>>>> >> whatnot. Cassandra guaranties that if you do a (succeeding) write on W
>>>> >> replica and then a (succeeding) read on R replica and if R+W>N, then it
>>>> >> is
>>>> >> guaranteed that the read will see the preceding write. And this is what
>>>> >> is
>>>> >> called consistency in the context of eventual consistency (which is not
>>>> >> the
>>>> >> context of ACID).
>>>> >> If this is not the definition of consistency you had in mind then by
>>>> >> all
>>>> >> mean, Cassandra probably don't guarantee this definition. But given
>>>> >> that the
>>>> >> paragraph preceding what you pasted state clearly we are not talking
>>>> >> about
>>>> >> ACID consistency, but eventual consistency, I don't think the wiki is
>>>> >> making
>>>> >> any unfair statement.
>>>> >> That being said, the wiki may not be always as clear as it could. But
>>>> >> it's
>>>> >> an editable wiki :)
>>>> >> --
>>>> >> Sylvain
>>>> >>
>>>> >>>
>>>> >>> I can still use Cassandra, and will use it, luv it!!! But let us not
>>>> >>> make
>>>> >>> this statement on the Wiki architecture section:-
>>>> >>> -------------------------------------------------------------
>>>> >>>
>>>> >>> More specifically: R=read replica count W=write replica
>>>> >>> count N=replication factor Q=QUORUM (Q = N / 2 + 1)
>>>> >>>
>>>> >>> If W + R > N, you will have consistency
>>>> >>>
>>>> >>> W=1, R=N
>>>> >>> W=N, R=1
>>>> >>> W=Q, R=Q where Q = N / 2 + 1
>>>> >>>
>>>> >>> Cassandra provides consistency when R + W > N (read replica count
>>>> >>> + write
>>>> >>> replica count > replication factor).
>>>> >>>
>>>> >>> ----------------------------------------------------
>>>> >>>
>>>> >>> .
>>>> >>>
>>>> >>> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne
>>>> >>> <sy...@datastax.com>
>>>> >>> wrote:
>>>> >>>>
>>>> >>>> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John <ch...@gmail.com>
>>>> >>>> wrote:
>>>> >>>>>
>>>> >>>>> If you are correct and you are probably closer to the code - then CL
>>>> >>>>> of
>>>> >>>>> Quorum does not guarantee a consistency.
>>>> >>>>
>>>> >>>> If the operation succeed, it does (for some definition of consistency
>>>> >>>> which is, following reads at Quorum will be guaranteed to see the new
>>>> >>>> value
>>>> >>>> of a update at quorum). If it fails, then no, it does not guarantee
>>>> >>>> consistency.
>>>> >>>> It is important to note that the word consistency has multiple
>>>> >>>> meaning.
>>>> >>>> In particular, when we are talking of consistency in Cassandra, we
>>>> >>>> are not
>>>> >>>> talking of the same definition as the C in ACID
>>>> >>>>
>>>> >>>> (see: http://www.allthingsdistributed.com/2007/12/eventually_consistent.html)
>>>> >>>>>
>>>> >>>>> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne
>>>> >>>>> <sy...@datastax.com> wrote:
>>>> >>>>>>
>>>> >>>>>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John
>>>> >>>>>> <ch...@gmail.com>
>>>> >>>>>> wrote:
>>>> >>>>>>>>
>>>> >>>>>>>> >>Time stamps are not used for conflict resolution - unless is is
>>>> >>>>>>>> >> part of the application logic!!!
>>>> >>>>>>>
>>>> >>>>>>> >>What is you definition of conflict resolution ? Because if you
>>>> >>>>>>> >> update twice the same column (which
>>>> >>>>>>> >>I'll call a conflict), then the timestamps are used to decide
>>>> >>>>>>> >> which
>>>> >>>>>>> >> update wins (which I'll call a resolution).
>>>> >>>>>>> I understand what you are saying, and yes semantics is very
>>>> >>>>>>> important
>>>> >>>>>>> here. And yes we are responding to the immediate questions without
>>>> >>>>>>> covering
>>>> >>>>>>> all questions in the thread.
>>>> >>>>>>> The point being made here is that the timestamp of the column is
>>>> >>>>>>> not
>>>> >>>>>>> used by Cassandra to figure out what data to return.
>>>> >>>>>>
>>>> >>>>>> Not quite true.
>>>> >>>>>>>
>>>> >>>>>>> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
>>>> >>>>>>> A Quorum  Write comes and add/updates the time stamp (TS2) of a
>>>> >>>>>>> particular data element. It succeeds on N1 - fails on N2/3. So the
>>>> >>>>>>> write is
>>>> >>>>>>> returned as failed - right ?
>>>> >>>>>>> Now Quorum read comes in for exactly the same piece of data that
>>>> >>>>>>> the
>>>> >>>>>>> write failed for.
>>>> >>>>>>> So N1 has TS2 but both N2/3 have the old TS (say TS1)
>>>> >>>>>>> And the read succeeds - Will it return TS1 or TS2.
>>>> >>>>>>> I submit it will return TS1 - the old TS.
>>>> >>>>>>
>>>> >>>>>> It all depends on which (first 2) nodes respond to the read (since
>>>> >>>>>> RF=3, that can any two of N1/N2/N3). If N1 is part of the two that
>>>> >>>>>> makes the
>>>> >>>>>> quorum, then TS2 will be returned, because cassandra will compare
>>>> >>>>>> the
>>>> >>>>>> timestamp and decide what to return based on this. If N2/N3
>>>> >>>>>> responds
>>>> >>>>>> however, both timestamp will be TS1 and so, after timestamp
>>>> >>>>>> resolution, it
>>>> >>>>>> will stil be TS1 that will be returned.
>>>> >>>>>> So yes timestamp is used for conflict resolution.
>>>> >>>>>> In your example, you could get TS1 back because a failed write can
>>>> >>>>>> let
>>>> >>>>>> you cluster in an inconsistent state. You'd have to retry the
>>>> >>>>>> quorum and
>>>> >>>>>> only when it succeeds can you be guaranteed that quorum read will
>>>> >>>>>> always
>>>> >>>>>> return TS2.
>>>> >>>>>> This is because when a write fails, Cassandra doesn't guarantee
>>>> >>>>>> that
>>>> >>>>>> the write did not made it in (there is no revert).
>>>> >>>>>>
>>>> >>>>>>>
>>>> >>>>>>> Are we on the same page with this interpretation ?
>>>> >>>>>>> Regards,
>>>> >>>>>>> -JA
>>>> >>>>>>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne
>>>> >>>>>>> <sy...@datastax.com> wrote:
>>>> >>>>>>>>
>>>> >>>>>>>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John
>>>> >>>>>>>> <ch...@gmail.com> wrote:
>>>> >>>>>>>>>
>>>> >>>>>>>>> Sylvan,
>>>> >>>>>>>>> Time stamps are not used for conflict resolution - unless is is
>>>> >>>>>>>>> part of the application logic!!!
>>>> >>>>>>>>
>>>> >>>>>>>> What is you definition of conflict resolution ? Because if you
>>>> >>>>>>>> update twice the same column (which
>>>> >>>>>>>> I'll call a conflict), then the timestamps are used to decide
>>>> >>>>>>>> which
>>>> >>>>>>>> update wins (which I'll call a resolution).
>>>> >>>>>>>>
>>>> >>>>>>>>>
>>>> >>>>>>>>> You can have "lost updates" w/Cassandra. You need to to use 3rd
>>>> >>>>>>>>> products - cages for e.g. - to get ACID type consistency.
>>>> >>>>>>>>
>>>> >>>>>>>> Then again, you'll have to define what you are calling "lost
>>>> >>>>>>>> updates". Provided you use a reasonable consistency level,
>>>> >>>>>>>> Cassandra
>>>> >>>>>>>> provides fairly strong durability guarantee, so for some
>>>> >>>>>>>> definition you
>>>> >>>>>>>> don't "lose updates".
>>>> >>>>>>>> That being said, I never pretended that Cassandra provided any
>>>> >>>>>>>> ACID
>>>> >>>>>>>> guarantee. ACID relates to transaction, which Cassandra doesn't
>>>> >>>>>>>> support. If
>>>> >>>>>>>> we're talking about the guarantees of transaction, then by all
>>>> >>>>>>>> means,
>>>> >>>>>>>> cassandra won't provide it. And yes you can use cages or the like
>>>> >>>>>>>> to get
>>>> >>>>>>>> transaction. But that was not the point of the thread, was it ?
>>>> >>>>>>>> The thread
>>>> >>>>>>>> is about vector clocks, and that has nothing to do with
>>>> >>>>>>>> transaction (vector
>>>> >>>>>>>> clocks certainly don't give you transactions).
>>>> >>>>>>>> Sorry if I wasn't clear in my mail, but I was only responding to
>>>> >>>>>>>> why
>>>> >>>>>>>> so far I don't think vector clocks would really provide much for
>>>> >>>>>>>> Cassandra.
>>>> >>>>>>>> --
>>>> >>>>>>>> Sylvain
>>>> >>>>>>>>
>>>> >>>>>>>>>
>>>> >>>>>>>>> -JA
>>>> >>>>>>>>>
>>>> >>>>>>>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne
>>>> >>>>>>>>> <sy...@datastax.com> wrote:
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John
>>>> >>>>>>>>>> <ch...@gmail.com> wrote:
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> Apologies : For some reason my response on the original mail
>>>> >>>>>>>>>>> keeps bouncing back, thus this new one!
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> > From the other hand, the same article says:
>>>> >>>>>>>>>>> > "For conditional writes to work, the condition must be
>>>> >>>>>>>>>>> > evaluated at all update
>>>> >>>>>>>>>>> > sites before the write can be allowed to succeed."
>>>> >>>>>>>>>>> >
>>>> >>>>>>>>>>> > This means, that when doing such an update CL=ALL must be
>>>> >>>>>>>>>>> > used
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> Sorry, but I am confused by that entire thread!
>>>> >>>>>>>>>>> Questions:-
>>>> >>>>>>>>>>> 1. Does Cassandra implement any kind of data locking - at any
>>>> >>>>>>>>>>> granularity whether it be row/colF/Col ?
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> No locking, no.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>>>
>>>> >>>>>>>>>>> 2. If the answer to 1 above is NO! - how does CL ALL prevent
>>>> >>>>>>>>>>> conflicts. Concurrent updates on exactly the same piece of
>>>> >>>>>>>>>>> data on different
>>>> >>>>>>>>>>> nodes can still mess each other up, right ?
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> Not sure why you are taking CL.ALL specifically. But in any CL,
>>>> >>>>>>>>>> updating the same piece of data means the same column value. In
>>>> >>>>>>>>>> that case,
>>>> >>>>>>>>>> the resolution rules are the following:
>>>> >>>>>>>>>>   - If the updates have a different timestamp, keep the one
>>>> >>>>>>>>>> with
>>>> >>>>>>>>>> the higher timestamp. That is, the more recent of two updates
>>>> >>>>>>>>>> win.
>>>> >>>>>>>>>>   - It the timestamps are the same, then it compares the values
>>>> >>>>>>>>>> (byte comparison) and keep the highest value. This is just to
>>>> >>>>>>>>>> break ties in
>>>> >>>>>>>>>> a consistent manner.
>>>> >>>>>>>>>> So if you do two truly concurrent updates (that is from two
>>>> >>>>>>>>>> place
>>>> >>>>>>>>>> at the same instant), then you'll end with one of the update.
>>>> >>>>>>>>>> This is the
>>>> >>>>>>>>>> column level.
>>>> >>>>>>>>>> However, if that simple conflict detection/resolution mechanism
>>>> >>>>>>>>>> is
>>>> >>>>>>>>>> not good enough for some of your use case and you need to keep
>>>> >>>>>>>>>> two
>>>> >>>>>>>>>> concurrent updates, it is easy enough. Just make sure that the
>>>> >>>>>>>>>> update don't
>>>> >>>>>>>>>> end up in the same column. This is easily achieved by appending
>>>> >>>>>>>>>> some unique
>>>> >>>>>>>>>> identifier to the column name for instance. And when reading,
>>>> >>>>>>>>>> do a slice and
>>>> >>>>>>>>>> reconcile whatever you get back with whatever logic make sense.
>>>> >>>>>>>>>> If you do
>>>> >>>>>>>>>> that, congrats, you've roughly emulated what vector clocks
>>>> >>>>>>>>>> would do. Btw, no
>>>> >>>>>>>>>> locking or anything needed.
>>>> >>>>>>>>>> In my experience, for most things the timestamp resolution is
>>>> >>>>>>>>>> enough. If the same user update twice it's profile picture on
>>>> >>>>>>>>>> you web site
>>>> >>>>>>>>>> at the same microsecond, it's usually fine to end up with one
>>>> >>>>>>>>>> of the two
>>>> >>>>>>>>>> pictures. In the rare case where you need something more
>>>> >>>>>>>>>> specific, using the
>>>> >>>>>>>>>> cassandra data model usually solves the problem easily. The
>>>> >>>>>>>>>> reason for not
>>>> >>>>>>>>>> having vector clocks in Cassandra is that so far, we haven't
>>>> >>>>>>>>>> really found
>>>> >>>>>>>>>> much example where it is no the case.
>>>> >>>>>>>>>>
>>>> >>>>>>>>>> --
>>>> >>>>>>>>>> Sylvain
>>>> >>>>>>>>>
>>>> >>>>>>>>
>>>> >>>>>>>
>>>> >>>>>>
>>>> >>>>>
>>>> >>>>
>>>> >>>
>>>> >>
>>>> >
>>>> >
>>>
>>>
>>
>
>
> Just to make a note the "EVENTUAL" in eventual consistency could be a
> time that is less then 1ms.
>
> I have a program that demonstrates that "eventual" means if i write
> data at the weakest level, and read it back from a random another node
> as soon as possible. 99% I see the update. I can share the code if you
> would like.
>
> Remember http://en.wikipedia.org/wiki/Spacetime
> ...but there is no reference frame in which the two events can occur
> at the same time...
>
> As to MongoDB references ....Yes! most of the noSQL work differently.
> They each approach CAP
> http://www.julianbrowne.com/article/viewer/brewers-cap-theorem in a
> different way.
>
> Cassandra does not lock (it is no secret). But remember, you can not
> have it all pick 2/3 from CAP.
>

Re: New Chain for : Does Cassandra use vector clocks

Posted by Edward Capriolo <ed...@gmail.com>.

On Thu, Feb 24, 2011 at 3:03 PM, A J <s5...@gmail.com> wrote:
> yes, that is difficult to digest and one has to be sure if the use
> case can afford it.
>
> Some other NOSQL databases deals with it differently (though I don't
> think any of them use atomic 2-phase commit). MongoDB for example will
> ask you to read from the node you wrote first (primary node) unless
> you are ok with eventual consistency. If the write did not make to
> majority of other nodes, it will be rolled-back from the original
> primary when it comes up again as a secondary.
> In some cases, you still could server either new value (that was
> returned as failed) or the old one. But it is different from Cassandra
> in the sense that Cassandra will never rollback.
>
>
>
> On Thu, Feb 24, 2011 at 2:47 PM, Anthony John <ch...@gmail.com> wrote:
>> The leap of faith here is that an error does not mean a clean backing out to
>> prior state - as we are used to with databases. It means that the operation
>> in error could have gone through partially
>>
>> Again, this is not an absolutely unfamiliar territory and can be dealt with.
>> -JA
>> On Thu, Feb 24, 2011 at 1:16 PM, A J <s5...@gmail.com> wrote:
>>>
>>> >>but could be broken in case of a failed write<<
>>> You can think of a scenario where R + W >N still leads to
>>> inconsistency even for successful writes. Say you keep W=1 and R=N .
>>> Lets say the one node where a write happened with success goes down
>>> before it made to the other N-1 nodes. Lets say it goes down for good
>>> and is unrecoverable. The only option is to build a new node from
>>> scratch from other active nodes. This will lead to a write that was
>>> lost and you will end up serving stale copy of it.
>>>
>>> It is better to talk in terms of use cases and if cassandra will be a
>>> fit for it. Otherwise unless you have W=R=N and fsync before each
>>> write commit, there will be scope for inconsistency.
>>>
>>>
>>> On Thu, Feb 24, 2011 at 1:25 PM, Anthony John <ch...@gmail.com>
>>> wrote:
>>> > I see the point - apologies for putting everyone through this!
>>> > It was just militating against my mental model.
>>> > In summary, here is my take away - simple stuff but - IMO - important to
>>> > conclude this thread (I hope):-
>>> > 1. I was splitting hair over a failed ( partial ) Q Write. Such an event
>>> > should be immediately followed by the same write going to a connection
>>> > on to
>>> > another node ( potentially using connection caches of client
>>> > implementations
>>> > ) or a Read at CL of All. Because a write could have partially gone
>>> > through.
>>> > 2. Timestamps are used in determining the latest version ( correcting
>>> > the
>>> > false impression I was propagating)
>>> > Finally, wrt "W + R > N for Q CL statement" holds, but could be broken
>>> > in
>>> > case of a failed write as it is unsure whether the new value got written
>>> > on
>>> >  any server or not. Is that a fair characterization ?
>>> > Bottom line - unlike traditional DBMS, errors do not ensure automatic
>>> > cleanup and revert back, app code has to follow up if  immediate - and
>>> > not
>>> > eventual -  consistency is desired. I made that leap in almost all cases
>>> > - I
>>> > think - but the case of a failed write.
>>> > My bad and I can live with this!
>>> > Regards,
>>> > -JA
>>> >
>>> > On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne
>>> > <sy...@datastax.com>
>>> > wrote:
>>> >>
>>> >> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John <ch...@gmail.com>
>>> >> wrote:
>>> >>>
>>> >>> Completely understand!
>>> >>> All that I am quibbling over is whether a CL of quorum guarantees
>>> >>> consistency or not. That is what the documentation says - right. IF
>>> >>> for a CL
>>> >>> of Q read - it depends on which node returns read first to determine
>>> >>> the
>>> >>> actual returned result or other more convoluted conditions , then a
>>> >>> Quorum
>>> >>> read/write is not consistent, by any definition.
>>> >>
>>> >> But that's the point. The definition of consistency we are talking
>>> >> about
>>> >> has no meaning if you consider only a quorum read. The definition
>>> >> (which is
>>> >> the de facto definition of consistency in 'eventually consistent') make
>>> >> sense if we talk about a write followed by a read. And it is
>>> >> considering succeeding write followed by succeeding read.
>>> >> And that is the statement the wiki is making.
>>> >> Honestly, we could debate forever on the definition of consistency and
>>> >> whatnot. Cassandra guaranties that if you do a (succeeding) write on W
>>> >> replica and then a (succeeding) read on R replica and if R+W>N, then it
>>> >> is
>>> >> guaranteed that the read will see the preceding write. And this is what
>>> >> is
>>> >> called consistency in the context of eventual consistency (which is not
>>> >> the
>>> >> context of ACID).
>>> >> If this is not the definition of consistency you had in mind then by
>>> >> all
>>> >> mean, Cassandra probably don't guarantee this definition. But given
>>> >> that the
>>> >> paragraph preceding what you pasted state clearly we are not talking
>>> >> about
>>> >> ACID consistency, but eventual consistency, I don't think the wiki is
>>> >> making
>>> >> any unfair statement.
>>> >> That being said, the wiki may not be always as clear as it could. But
>>> >> it's
>>> >> an editable wiki :)
>>> >> --
>>> >> Sylvain
>>> >>
>>> >>>
>>> >>> I can still use Cassandra, and will use it, luv it!!! But let us not
>>> >>> make
>>> >>> this statement on the Wiki architecture section:-
>>> >>> -------------------------------------------------------------
>>> >>>
>>> >>> More specifically: R=read replica count W=write replica
>>> >>> count N=replication factor Q=QUORUM (Q = N / 2 + 1)
>>> >>>
>>> >>> If W + R > N, you will have consistency
>>> >>>
>>> >>> W=1, R=N
>>> >>> W=N, R=1
>>> >>> W=Q, R=Q where Q = N / 2 + 1
>>> >>>
>>> >>> Cassandra provides consistency when R + W > N (read replica count
>>> >>> + write
>>> >>> replica count > replication factor).
>>> >>>
>>> >>> ----------------------------------------------------
>>> >>>
>>> >>> .
>>> >>>
>>> >>> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne
>>> >>> <sy...@datastax.com>
>>> >>> wrote:
>>> >>>>
>>> >>>> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John <ch...@gmail.com>
>>> >>>> wrote:
>>> >>>>>
>>> >>>>> If you are correct and you are probably closer to the code - then CL
>>> >>>>> of
>>> >>>>> Quorum does not guarantee a consistency.
>>> >>>>
>>> >>>> If the operation succeed, it does (for some definition of consistency
>>> >>>> which is, following reads at Quorum will be guaranteed to see the new
>>> >>>> value
>>> >>>> of a update at quorum). If it fails, then no, it does not guarantee
>>> >>>> consistency.
>>> >>>> It is important to note that the word consistency has multiple
>>> >>>> meaning.
>>> >>>> In particular, when we are talking of consistency in Cassandra, we
>>> >>>> are not
>>> >>>> talking of the same definition as the C in ACID
>>> >>>>
>>> >>>> (see: http://www.allthingsdistributed.com/2007/12/eventually_consistent.html)
>>> >>>>>
>>> >>>>> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne
>>> >>>>> <sy...@datastax.com> wrote:
>>> >>>>>>
>>> >>>>>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John
>>> >>>>>> <ch...@gmail.com>
>>> >>>>>> wrote:
>>> >>>>>>>>
>>> >>>>>>>> >>Time stamps are not used for conflict resolution - unless is is
>>> >>>>>>>> >> part of the application logic!!!
>>> >>>>>>>
>>> >>>>>>> >>What is you definition of conflict resolution ? Because if you
>>> >>>>>>> >> update twice the same column (which
>>> >>>>>>> >>I'll call a conflict), then the timestamps are used to decide
>>> >>>>>>> >> which
>>> >>>>>>> >> update wins (which I'll call a resolution).
>>> >>>>>>> I understand what you are saying, and yes semantics is very
>>> >>>>>>> important
>>> >>>>>>> here. And yes we are responding to the immediate questions without
>>> >>>>>>> covering
>>> >>>>>>> all questions in the thread.
>>> >>>>>>> The point being made here is that the timestamp of the column is
>>> >>>>>>> not
>>> >>>>>>> used by Cassandra to figure out what data to return.
>>> >>>>>>
>>> >>>>>> Not quite true.
>>> >>>>>>>
>>> >>>>>>> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
>>> >>>>>>> A Quorum  Write comes and add/updates the time stamp (TS2) of a
>>> >>>>>>> particular data element. It succeeds on N1 - fails on N2/3. So the
>>> >>>>>>> write is
>>> >>>>>>> returned as failed - right ?
>>> >>>>>>> Now Quorum read comes in for exactly the same piece of data that
>>> >>>>>>> the
>>> >>>>>>> write failed for.
>>> >>>>>>> So N1 has TS2 but both N2/3 have the old TS (say TS1)
>>> >>>>>>> And the read succeeds - Will it return TS1 or TS2.
>>> >>>>>>> I submit it will return TS1 - the old TS.
>>> >>>>>>
>>> >>>>>> It all depends on which (first 2) nodes respond to the read (since
>>> >>>>>> RF=3, that can any two of N1/N2/N3). If N1 is part of the two that
>>> >>>>>> makes the
>>> >>>>>> quorum, then TS2 will be returned, because cassandra will compare
>>> >>>>>> the
>>> >>>>>> timestamp and decide what to return based on this. If N2/N3
>>> >>>>>> responds
>>> >>>>>> however, both timestamp will be TS1 and so, after timestamp
>>> >>>>>> resolution, it
>>> >>>>>> will stil be TS1 that will be returned.
>>> >>>>>> So yes timestamp is used for conflict resolution.
>>> >>>>>> In your example, you could get TS1 back because a failed write can
>>> >>>>>> let
>>> >>>>>> you cluster in an inconsistent state. You'd have to retry the
>>> >>>>>> quorum and
>>> >>>>>> only when it succeeds can you be guaranteed that quorum read will
>>> >>>>>> always
>>> >>>>>> return TS2.
>>> >>>>>> This is because when a write fails, Cassandra doesn't guarantee
>>> >>>>>> that
>>> >>>>>> the write did not made it in (there is no revert).
>>> >>>>>>
>>> >>>>>>>
>>> >>>>>>> Are we on the same page with this interpretation ?
>>> >>>>>>> Regards,
>>> >>>>>>> -JA
>>> >>>>>>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne
>>> >>>>>>> <sy...@datastax.com> wrote:
>>> >>>>>>>>
>>> >>>>>>>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John
>>> >>>>>>>> <ch...@gmail.com> wrote:
>>> >>>>>>>>>
>>> >>>>>>>>> Sylvan,
>>> >>>>>>>>> Time stamps are not used for conflict resolution - unless is is
>>> >>>>>>>>> part of the application logic!!!
>>> >>>>>>>>
>>> >>>>>>>> What is you definition of conflict resolution ? Because if you
>>> >>>>>>>> update twice the same column (which
>>> >>>>>>>> I'll call a conflict), then the timestamps are used to decide
>>> >>>>>>>> which
>>> >>>>>>>> update wins (which I'll call a resolution).
>>> >>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> You can have "lost updates" w/Cassandra. You need to to use 3rd
>>> >>>>>>>>> products - cages for e.g. - to get ACID type consistency.
>>> >>>>>>>>
>>> >>>>>>>> Then again, you'll have to define what you are calling "lost
>>> >>>>>>>> updates". Provided you use a reasonable consistency level,
>>> >>>>>>>> Cassandra
>>> >>>>>>>> provides fairly strong durability guarantee, so for some
>>> >>>>>>>> definition you
>>> >>>>>>>> don't "lose updates".
>>> >>>>>>>> That being said, I never pretended that Cassandra provided any
>>> >>>>>>>> ACID
>>> >>>>>>>> guarantee. ACID relates to transaction, which Cassandra doesn't
>>> >>>>>>>> support. If
>>> >>>>>>>> we're talking about the guarantees of transaction, then by all
>>> >>>>>>>> means,
>>> >>>>>>>> cassandra won't provide it. And yes you can use cages or the like
>>> >>>>>>>> to get
>>> >>>>>>>> transaction. But that was not the point of the thread, was it ?
>>> >>>>>>>> The thread
>>> >>>>>>>> is about vector clocks, and that has nothing to do with
>>> >>>>>>>> transaction (vector
>>> >>>>>>>> clocks certainly don't give you transactions).
>>> >>>>>>>> Sorry if I wasn't clear in my mail, but I was only responding to
>>> >>>>>>>> why
>>> >>>>>>>> so far I don't think vector clocks would really provide much for
>>> >>>>>>>> Cassandra.
>>> >>>>>>>> --
>>> >>>>>>>> Sylvain
>>> >>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>> -JA
>>> >>>>>>>>>
>>> >>>>>>>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne
>>> >>>>>>>>> <sy...@datastax.com> wrote:
>>> >>>>>>>>>>
>>> >>>>>>>>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John
>>> >>>>>>>>>> <ch...@gmail.com> wrote:
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Apologies : For some reason my response on the original mail
>>> >>>>>>>>>>> keeps bouncing back, thus this new one!
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> > From the other hand, the same article says:
>>> >>>>>>>>>>> > "For conditional writes to work, the condition must be
>>> >>>>>>>>>>> > evaluated at all update
>>> >>>>>>>>>>> > sites before the write can be allowed to succeed."
>>> >>>>>>>>>>> >
>>> >>>>>>>>>>> > This means, that when doing such an update CL=ALL must be
>>> >>>>>>>>>>> > used
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> Sorry, but I am confused by that entire thread!
>>> >>>>>>>>>>> Questions:-
>>> >>>>>>>>>>> 1. Does Cassandra implement any kind of data locking - at any
>>> >>>>>>>>>>> granularity whether it be row/colF/Col ?
>>> >>>>>>>>>>
>>> >>>>>>>>>> No locking, no.
>>> >>>>>>>>>>
>>> >>>>>>>>>>>
>>> >>>>>>>>>>> 2. If the answer to 1 above is NO! - how does CL ALL prevent
>>> >>>>>>>>>>> conflicts. Concurrent updates on exactly the same piece of
>>> >>>>>>>>>>> data on different
>>> >>>>>>>>>>> nodes can still mess each other up, right ?
>>> >>>>>>>>>>
>>> >>>>>>>>>> Not sure why you are taking CL.ALL specifically. But in any CL,
>>> >>>>>>>>>> updating the same piece of data means the same column value. In
>>> >>>>>>>>>> that case,
>>> >>>>>>>>>> the resolution rules are the following:
>>> >>>>>>>>>>   - If the updates have a different timestamp, keep the one
>>> >>>>>>>>>> with
>>> >>>>>>>>>> the higher timestamp. That is, the more recent of two updates
>>> >>>>>>>>>> win.
>>> >>>>>>>>>>   - It the timestamps are the same, then it compares the values
>>> >>>>>>>>>> (byte comparison) and keep the highest value. This is just to
>>> >>>>>>>>>> break ties in
>>> >>>>>>>>>> a consistent manner.
>>> >>>>>>>>>> So if you do two truly concurrent updates (that is from two
>>> >>>>>>>>>> place
>>> >>>>>>>>>> at the same instant), then you'll end with one of the update.
>>> >>>>>>>>>> This is the
>>> >>>>>>>>>> column level.
>>> >>>>>>>>>> However, if that simple conflict detection/resolution mechanism
>>> >>>>>>>>>> is
>>> >>>>>>>>>> not good enough for some of your use case and you need to keep
>>> >>>>>>>>>> two
>>> >>>>>>>>>> concurrent updates, it is easy enough. Just make sure that the
>>> >>>>>>>>>> update don't
>>> >>>>>>>>>> end up in the same column. This is easily achieved by appending
>>> >>>>>>>>>> some unique
>>> >>>>>>>>>> identifier to the column name for instance. And when reading,
>>> >>>>>>>>>> do a slice and
>>> >>>>>>>>>> reconcile whatever you get back with whatever logic make sense.
>>> >>>>>>>>>> If you do
>>> >>>>>>>>>> that, congrats, you've roughly emulated what vector clocks
>>> >>>>>>>>>> would do. Btw, no
>>> >>>>>>>>>> locking or anything needed.
>>> >>>>>>>>>> In my experience, for most things the timestamp resolution is
>>> >>>>>>>>>> enough. If the same user update twice it's profile picture on
>>> >>>>>>>>>> you web site
>>> >>>>>>>>>> at the same microsecond, it's usually fine to end up with one
>>> >>>>>>>>>> of the two
>>> >>>>>>>>>> pictures. In the rare case where you need something more
>>> >>>>>>>>>> specific, using the
>>> >>>>>>>>>> cassandra data model usually solves the problem easily. The
>>> >>>>>>>>>> reason for not
>>> >>>>>>>>>> having vector clocks in Cassandra is that so far, we haven't
>>> >>>>>>>>>> really found
>>> >>>>>>>>>> much example where it is no the case.
>>> >>>>>>>>>>
>>> >>>>>>>>>> --
>>> >>>>>>>>>> Sylvain
>>> >>>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>
>>> >>>>>>
>>> >>>>>
>>> >>>>
>>> >>>
>>> >>
>>> >
>>> >
>>
>>
>


Just to make a note the "EVENTUAL" in eventual consistency could be a
time that is less then 1ms.

I have a program that demonstrates that "eventual" means if i write
data at the weakest level, and read it back from a random another node
as soon as possible. 99% I see the update. I can share the code if you
would like.

Remember http://en.wikipedia.org/wiki/Spacetime
...but there is no reference frame in which the two events can occur
at the same time...

As to MongoDB references ....Yes! most of the noSQL work differently.
They each approach CAP
http://www.julianbrowne.com/article/viewer/brewers-cap-theorem in a
different way.

Cassandra does not lock (it is no secret). But remember, you can not
have it all pick 2/3 from CAP.

Re: New Chain for : Does Cassandra use vector clocks

Posted by A J <s5...@gmail.com>.

yes, that is difficult to digest and one has to be sure if the use
case can afford it.

Some other NOSQL databases deals with it differently (though I don't
think any of them use atomic 2-phase commit). MongoDB for example will
ask you to read from the node you wrote first (primary node) unless
you are ok with eventual consistency. If the write did not make to
majority of other nodes, it will be rolled-back from the original
primary when it comes up again as a secondary.
In some cases, you still could server either new value (that was
returned as failed) or the old one. But it is different from Cassandra
in the sense that Cassandra will never rollback.



On Thu, Feb 24, 2011 at 2:47 PM, Anthony John <ch...@gmail.com> wrote:
> The leap of faith here is that an error does not mean a clean backing out to
> prior state - as we are used to with databases. It means that the operation
> in error could have gone through partially
>
> Again, this is not an absolutely unfamiliar territory and can be dealt with.
> -JA
> On Thu, Feb 24, 2011 at 1:16 PM, A J <s5...@gmail.com> wrote:
>>
>> >>but could be broken in case of a failed write<<
>> You can think of a scenario where R + W >N still leads to
>> inconsistency even for successful writes. Say you keep W=1 and R=N .
>> Lets say the one node where a write happened with success goes down
>> before it made to the other N-1 nodes. Lets say it goes down for good
>> and is unrecoverable. The only option is to build a new node from
>> scratch from other active nodes. This will lead to a write that was
>> lost and you will end up serving stale copy of it.
>>
>> It is better to talk in terms of use cases and if cassandra will be a
>> fit for it. Otherwise unless you have W=R=N and fsync before each
>> write commit, there will be scope for inconsistency.
>>
>>
>> On Thu, Feb 24, 2011 at 1:25 PM, Anthony John <ch...@gmail.com>
>> wrote:
>> > I see the point - apologies for putting everyone through this!
>> > It was just militating against my mental model.
>> > In summary, here is my take away - simple stuff but - IMO - important to
>> > conclude this thread (I hope):-
>> > 1. I was splitting hair over a failed ( partial ) Q Write. Such an event
>> > should be immediately followed by the same write going to a connection
>> > on to
>> > another node ( potentially using connection caches of client
>> > implementations
>> > ) or a Read at CL of All. Because a write could have partially gone
>> > through.
>> > 2. Timestamps are used in determining the latest version ( correcting
>> > the
>> > false impression I was propagating)
>> > Finally, wrt "W + R > N for Q CL statement" holds, but could be broken
>> > in
>> > case of a failed write as it is unsure whether the new value got written
>> > on
>> >  any server or not. Is that a fair characterization ?
>> > Bottom line - unlike traditional DBMS, errors do not ensure automatic
>> > cleanup and revert back, app code has to follow up if  immediate - and
>> > not
>> > eventual -  consistency is desired. I made that leap in almost all cases
>> > - I
>> > think - but the case of a failed write.
>> > My bad and I can live with this!
>> > Regards,
>> > -JA
>> >
>> > On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne
>> > <sy...@datastax.com>
>> > wrote:
>> >>
>> >> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John <ch...@gmail.com>
>> >> wrote:
>> >>>
>> >>> Completely understand!
>> >>> All that I am quibbling over is whether a CL of quorum guarantees
>> >>> consistency or not. That is what the documentation says - right. IF
>> >>> for a CL
>> >>> of Q read - it depends on which node returns read first to determine
>> >>> the
>> >>> actual returned result or other more convoluted conditions , then a
>> >>> Quorum
>> >>> read/write is not consistent, by any definition.
>> >>
>> >> But that's the point. The definition of consistency we are talking
>> >> about
>> >> has no meaning if you consider only a quorum read. The definition
>> >> (which is
>> >> the de facto definition of consistency in 'eventually consistent') make
>> >> sense if we talk about a write followed by a read. And it is
>> >> considering succeeding write followed by succeeding read.
>> >> And that is the statement the wiki is making.
>> >> Honestly, we could debate forever on the definition of consistency and
>> >> whatnot. Cassandra guaranties that if you do a (succeeding) write on W
>> >> replica and then a (succeeding) read on R replica and if R+W>N, then it
>> >> is
>> >> guaranteed that the read will see the preceding write. And this is what
>> >> is
>> >> called consistency in the context of eventual consistency (which is not
>> >> the
>> >> context of ACID).
>> >> If this is not the definition of consistency you had in mind then by
>> >> all
>> >> mean, Cassandra probably don't guarantee this definition. But given
>> >> that the
>> >> paragraph preceding what you pasted state clearly we are not talking
>> >> about
>> >> ACID consistency, but eventual consistency, I don't think the wiki is
>> >> making
>> >> any unfair statement.
>> >> That being said, the wiki may not be always as clear as it could. But
>> >> it's
>> >> an editable wiki :)
>> >> --
>> >> Sylvain
>> >>
>> >>>
>> >>> I can still use Cassandra, and will use it, luv it!!! But let us not
>> >>> make
>> >>> this statement on the Wiki architecture section:-
>> >>> -------------------------------------------------------------
>> >>>
>> >>> More specifically: R=read replica count W=write replica
>> >>> count N=replication factor Q=QUORUM (Q = N / 2 + 1)
>> >>>
>> >>> If W + R > N, you will have consistency
>> >>>
>> >>> W=1, R=N
>> >>> W=N, R=1
>> >>> W=Q, R=Q where Q = N / 2 + 1
>> >>>
>> >>> Cassandra provides consistency when R + W > N (read replica count
>> >>> + write
>> >>> replica count > replication factor).
>> >>>
>> >>> ----------------------------------------------------
>> >>>
>> >>> .
>> >>>
>> >>> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne
>> >>> <sy...@datastax.com>
>> >>> wrote:
>> >>>>
>> >>>> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John <ch...@gmail.com>
>> >>>> wrote:
>> >>>>>
>> >>>>> If you are correct and you are probably closer to the code - then CL
>> >>>>> of
>> >>>>> Quorum does not guarantee a consistency.
>> >>>>
>> >>>> If the operation succeed, it does (for some definition of consistency
>> >>>> which is, following reads at Quorum will be guaranteed to see the new
>> >>>> value
>> >>>> of a update at quorum). If it fails, then no, it does not guarantee
>> >>>> consistency.
>> >>>> It is important to note that the word consistency has multiple
>> >>>> meaning.
>> >>>> In particular, when we are talking of consistency in Cassandra, we
>> >>>> are not
>> >>>> talking of the same definition as the C in ACID
>> >>>>
>> >>>> (see: http://www.allthingsdistributed.com/2007/12/eventually_consistent.html)
>> >>>>>
>> >>>>> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne
>> >>>>> <sy...@datastax.com> wrote:
>> >>>>>>
>> >>>>>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John
>> >>>>>> <ch...@gmail.com>
>> >>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>> >>Time stamps are not used for conflict resolution - unless is is
>> >>>>>>>> >> part of the application logic!!!
>> >>>>>>>
>> >>>>>>> >>What is you definition of conflict resolution ? Because if you
>> >>>>>>> >> update twice the same column (which
>> >>>>>>> >>I'll call a conflict), then the timestamps are used to decide
>> >>>>>>> >> which
>> >>>>>>> >> update wins (which I'll call a resolution).
>> >>>>>>> I understand what you are saying, and yes semantics is very
>> >>>>>>> important
>> >>>>>>> here. And yes we are responding to the immediate questions without
>> >>>>>>> covering
>> >>>>>>> all questions in the thread.
>> >>>>>>> The point being made here is that the timestamp of the column is
>> >>>>>>> not
>> >>>>>>> used by Cassandra to figure out what data to return.
>> >>>>>>
>> >>>>>> Not quite true.
>> >>>>>>>
>> >>>>>>> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
>> >>>>>>> A Quorum  Write comes and add/updates the time stamp (TS2) of a
>> >>>>>>> particular data element. It succeeds on N1 - fails on N2/3. So the
>> >>>>>>> write is
>> >>>>>>> returned as failed - right ?
>> >>>>>>> Now Quorum read comes in for exactly the same piece of data that
>> >>>>>>> the
>> >>>>>>> write failed for.
>> >>>>>>> So N1 has TS2 but both N2/3 have the old TS (say TS1)
>> >>>>>>> And the read succeeds - Will it return TS1 or TS2.
>> >>>>>>> I submit it will return TS1 - the old TS.
>> >>>>>>
>> >>>>>> It all depends on which (first 2) nodes respond to the read (since
>> >>>>>> RF=3, that can any two of N1/N2/N3). If N1 is part of the two that
>> >>>>>> makes the
>> >>>>>> quorum, then TS2 will be returned, because cassandra will compare
>> >>>>>> the
>> >>>>>> timestamp and decide what to return based on this. If N2/N3
>> >>>>>> responds
>> >>>>>> however, both timestamp will be TS1 and so, after timestamp
>> >>>>>> resolution, it
>> >>>>>> will stil be TS1 that will be returned.
>> >>>>>> So yes timestamp is used for conflict resolution.
>> >>>>>> In your example, you could get TS1 back because a failed write can
>> >>>>>> let
>> >>>>>> you cluster in an inconsistent state. You'd have to retry the
>> >>>>>> quorum and
>> >>>>>> only when it succeeds can you be guaranteed that quorum read will
>> >>>>>> always
>> >>>>>> return TS2.
>> >>>>>> This is because when a write fails, Cassandra doesn't guarantee
>> >>>>>> that
>> >>>>>> the write did not made it in (there is no revert).
>> >>>>>>
>> >>>>>>>
>> >>>>>>> Are we on the same page with this interpretation ?
>> >>>>>>> Regards,
>> >>>>>>> -JA
>> >>>>>>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne
>> >>>>>>> <sy...@datastax.com> wrote:
>> >>>>>>>>
>> >>>>>>>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John
>> >>>>>>>> <ch...@gmail.com> wrote:
>> >>>>>>>>>
>> >>>>>>>>> Sylvan,
>> >>>>>>>>> Time stamps are not used for conflict resolution - unless is is
>> >>>>>>>>> part of the application logic!!!
>> >>>>>>>>
>> >>>>>>>> What is you definition of conflict resolution ? Because if you
>> >>>>>>>> update twice the same column (which
>> >>>>>>>> I'll call a conflict), then the timestamps are used to decide
>> >>>>>>>> which
>> >>>>>>>> update wins (which I'll call a resolution).
>> >>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> You can have "lost updates" w/Cassandra. You need to to use 3rd
>> >>>>>>>>> products - cages for e.g. - to get ACID type consistency.
>> >>>>>>>>
>> >>>>>>>> Then again, you'll have to define what you are calling "lost
>> >>>>>>>> updates". Provided you use a reasonable consistency level,
>> >>>>>>>> Cassandra
>> >>>>>>>> provides fairly strong durability guarantee, so for some
>> >>>>>>>> definition you
>> >>>>>>>> don't "lose updates".
>> >>>>>>>> That being said, I never pretended that Cassandra provided any
>> >>>>>>>> ACID
>> >>>>>>>> guarantee. ACID relates to transaction, which Cassandra doesn't
>> >>>>>>>> support. If
>> >>>>>>>> we're talking about the guarantees of transaction, then by all
>> >>>>>>>> means,
>> >>>>>>>> cassandra won't provide it. And yes you can use cages or the like
>> >>>>>>>> to get
>> >>>>>>>> transaction. But that was not the point of the thread, was it ?
>> >>>>>>>> The thread
>> >>>>>>>> is about vector clocks, and that has nothing to do with
>> >>>>>>>> transaction (vector
>> >>>>>>>> clocks certainly don't give you transactions).
>> >>>>>>>> Sorry if I wasn't clear in my mail, but I was only responding to
>> >>>>>>>> why
>> >>>>>>>> so far I don't think vector clocks would really provide much for
>> >>>>>>>> Cassandra.
>> >>>>>>>> --
>> >>>>>>>> Sylvain
>> >>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> -JA
>> >>>>>>>>>
>> >>>>>>>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne
>> >>>>>>>>> <sy...@datastax.com> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John
>> >>>>>>>>>> <ch...@gmail.com> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>> Apologies : For some reason my response on the original mail
>> >>>>>>>>>>> keeps bouncing back, thus this new one!
>> >>>>>>>>>>>
>> >>>>>>>>>>> > From the other hand, the same article says:
>> >>>>>>>>>>> > "For conditional writes to work, the condition must be
>> >>>>>>>>>>> > evaluated at all update
>> >>>>>>>>>>> > sites before the write can be allowed to succeed."
>> >>>>>>>>>>> >
>> >>>>>>>>>>> > This means, that when doing such an update CL=ALL must be
>> >>>>>>>>>>> > used
>> >>>>>>>>>>>
>> >>>>>>>>>>> Sorry, but I am confused by that entire thread!
>> >>>>>>>>>>> Questions:-
>> >>>>>>>>>>> 1. Does Cassandra implement any kind of data locking - at any
>> >>>>>>>>>>> granularity whether it be row/colF/Col ?
>> >>>>>>>>>>
>> >>>>>>>>>> No locking, no.
>> >>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> 2. If the answer to 1 above is NO! - how does CL ALL prevent
>> >>>>>>>>>>> conflicts. Concurrent updates on exactly the same piece of
>> >>>>>>>>>>> data on different
>> >>>>>>>>>>> nodes can still mess each other up, right ?
>> >>>>>>>>>>
>> >>>>>>>>>> Not sure why you are taking CL.ALL specifically. But in any CL,
>> >>>>>>>>>> updating the same piece of data means the same column value. In
>> >>>>>>>>>> that case,
>> >>>>>>>>>> the resolution rules are the following:
>> >>>>>>>>>>   - If the updates have a different timestamp, keep the one
>> >>>>>>>>>> with
>> >>>>>>>>>> the higher timestamp. That is, the more recent of two updates
>> >>>>>>>>>> win.
>> >>>>>>>>>>   - It the timestamps are the same, then it compares the values
>> >>>>>>>>>> (byte comparison) and keep the highest value. This is just to
>> >>>>>>>>>> break ties in
>> >>>>>>>>>> a consistent manner.
>> >>>>>>>>>> So if you do two truly concurrent updates (that is from two
>> >>>>>>>>>> place
>> >>>>>>>>>> at the same instant), then you'll end with one of the update.
>> >>>>>>>>>> This is the
>> >>>>>>>>>> column level.
>> >>>>>>>>>> However, if that simple conflict detection/resolution mechanism
>> >>>>>>>>>> is
>> >>>>>>>>>> not good enough for some of your use case and you need to keep
>> >>>>>>>>>> two
>> >>>>>>>>>> concurrent updates, it is easy enough. Just make sure that the
>> >>>>>>>>>> update don't
>> >>>>>>>>>> end up in the same column. This is easily achieved by appending
>> >>>>>>>>>> some unique
>> >>>>>>>>>> identifier to the column name for instance. And when reading,
>> >>>>>>>>>> do a slice and
>> >>>>>>>>>> reconcile whatever you get back with whatever logic make sense.
>> >>>>>>>>>> If you do
>> >>>>>>>>>> that, congrats, you've roughly emulated what vector clocks
>> >>>>>>>>>> would do. Btw, no
>> >>>>>>>>>> locking or anything needed.
>> >>>>>>>>>> In my experience, for most things the timestamp resolution is
>> >>>>>>>>>> enough. If the same user update twice it's profile picture on
>> >>>>>>>>>> you web site
>> >>>>>>>>>> at the same microsecond, it's usually fine to end up with one
>> >>>>>>>>>> of the two
>> >>>>>>>>>> pictures. In the rare case where you need something more
>> >>>>>>>>>> specific, using the
>> >>>>>>>>>> cassandra data model usually solves the problem easily. The
>> >>>>>>>>>> reason for not
>> >>>>>>>>>> having vector clocks in Cassandra is that so far, we haven't
>> >>>>>>>>>> really found
>> >>>>>>>>>> much example where it is no the case.
>> >>>>>>>>>>
>> >>>>>>>>>> --
>> >>>>>>>>>> Sylvain
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> >
>> >
>
>

Re: New Chain for : Does Cassandra use vector clocks

Posted by Anthony John <ch...@gmail.com>.

The leap of faith here is that an error does not mean a clean backing out to
prior state - as we are used to with databases. It means that the operation
in error could have gone through partially

Again, this is not an absolutely unfamiliar territory and can be dealt with.

-JA

On Thu, Feb 24, 2011 at 1:16 PM, A J <s5...@gmail.com> wrote:

> >>but could be broken in case of a failed write<<
> You can think of a scenario where R + W >N still leads to
> inconsistency even for successful writes. Say you keep W=1 and R=N .
> Lets say the one node where a write happened with success goes down
> before it made to the other N-1 nodes. Lets say it goes down for good
> and is unrecoverable. The only option is to build a new node from
> scratch from other active nodes. This will lead to a write that was
> lost and you will end up serving stale copy of it.
>
> It is better to talk in terms of use cases and if cassandra will be a
> fit for it. Otherwise unless you have W=R=N and fsync before each
> write commit, there will be scope for inconsistency.
>
>
> On Thu, Feb 24, 2011 at 1:25 PM, Anthony John <ch...@gmail.com>
> wrote:
> > I see the point - apologies for putting everyone through this!
> > It was just militating against my mental model.
> > In summary, here is my take away - simple stuff but - IMO - important to
> > conclude this thread (I hope):-
> > 1. I was splitting hair over a failed ( partial ) Q Write. Such an event
> > should be immediately followed by the same write going to a connection on
> to
> > another node ( potentially using connection caches of client
> implementations
> > ) or a Read at CL of All. Because a write could have partially gone
> through.
> > 2. Timestamps are used in determining the latest version ( correcting the
> > false impression I was propagating)
> > Finally, wrt "W + R > N for Q CL statement" holds, but could be broken in
> > case of a failed write as it is unsure whether the new value got written
> on
> >  any server or not. Is that a fair characterization ?
> > Bottom line - unlike traditional DBMS, errors do not ensure automatic
> > cleanup and revert back, app code has to follow up if  immediate - and
> not
> > eventual -  consistency is desired. I made that leap in almost all cases
> - I
> > think - but the case of a failed write.
> > My bad and I can live with this!
> > Regards,
> > -JA
> >
> > On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne <sylvain@datastax.com
> >
> > wrote:
> >>
> >> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John <ch...@gmail.com>
> >> wrote:
> >>>
> >>> Completely understand!
> >>> All that I am quibbling over is whether a CL of quorum guarantees
> >>> consistency or not. That is what the documentation says - right. IF for
> a CL
> >>> of Q read - it depends on which node returns read first to determine
> the
> >>> actual returned result or other more convoluted conditions , then a
> Quorum
> >>> read/write is not consistent, by any definition.
> >>
> >> But that's the point. The definition of consistency we are talking about
> >> has no meaning if you consider only a quorum read. The definition (which
> is
> >> the de facto definition of consistency in 'eventually consistent') make
> >> sense if we talk about a write followed by a read. And it is
> >> considering succeeding write followed by succeeding read.
> >> And that is the statement the wiki is making.
> >> Honestly, we could debate forever on the definition of consistency and
> >> whatnot. Cassandra guaranties that if you do a (succeeding) write on W
> >> replica and then a (succeeding) read on R replica and if R+W>N, then it
> is
> >> guaranteed that the read will see the preceding write. And this is what
> is
> >> called consistency in the context of eventual consistency (which is not
> the
> >> context of ACID).
> >> If this is not the definition of consistency you had in mind then by all
> >> mean, Cassandra probably don't guarantee this definition. But given that
> the
> >> paragraph preceding what you pasted state clearly we are not talking
> about
> >> ACID consistency, but eventual consistency, I don't think the wiki is
> making
> >> any unfair statement.
> >> That being said, the wiki may not be always as clear as it could. But
> it's
> >> an editable wiki :)
> >> --
> >> Sylvain
> >>
> >>>
> >>> I can still use Cassandra, and will use it, luv it!!! But let us not
> make
> >>> this statement on the Wiki architecture section:-
> >>> -------------------------------------------------------------
> >>>
> >>> More specifically: R=read replica count W=write replica
> >>> count N=replication factor Q=QUORUM (Q = N / 2 + 1)
> >>>
> >>> If W + R > N, you will have consistency
> >>>
> >>> W=1, R=N
> >>> W=N, R=1
> >>> W=Q, R=Q where Q = N / 2 + 1
> >>>
> >>> Cassandra provides consistency when R + W > N (read replica count
> + write
> >>> replica count > replication factor).
> >>>
> >>> ----------------------------------------------------
> >>>
> >>> .
> >>>
> >>> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne <
> sylvain@datastax.com>
> >>> wrote:
> >>>>
> >>>> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John <ch...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>> If you are correct and you are probably closer to the code - then CL
> of
> >>>>> Quorum does not guarantee a consistency.
> >>>>
> >>>> If the operation succeed, it does (for some definition of consistency
> >>>> which is, following reads at Quorum will be guaranteed to see the new
> value
> >>>> of a update at quorum). If it fails, then no, it does not guarantee
> >>>> consistency.
> >>>> It is important to note that the word consistency has multiple
> meaning.
> >>>> In particular, when we are talking of consistency in Cassandra, we are
> not
> >>>> talking of the same definition as the C in ACID
> >>>> (see:
> http://www.allthingsdistributed.com/2007/12/eventually_consistent.html)
> >>>>>
> >>>>> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne
> >>>>> <sy...@datastax.com> wrote:
> >>>>>>
> >>>>>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John <
> chirayithaj@gmail.com>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>> >>Time stamps are not used for conflict resolution - unless is is
> >>>>>>>> >> part of the application logic!!!
> >>>>>>>
> >>>>>>> >>What is you definition of conflict resolution ? Because if you
> >>>>>>> >> update twice the same column (which
> >>>>>>> >>I'll call a conflict), then the timestamps are used to decide
> which
> >>>>>>> >> update wins (which I'll call a resolution).
> >>>>>>> I understand what you are saying, and yes semantics is very
> important
> >>>>>>> here. And yes we are responding to the immediate questions without
> covering
> >>>>>>> all questions in the thread.
> >>>>>>> The point being made here is that the timestamp of the column is
> not
> >>>>>>> used by Cassandra to figure out what data to return.
> >>>>>>
> >>>>>> Not quite true.
> >>>>>>>
> >>>>>>> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
> >>>>>>> A Quorum  Write comes and add/updates the time stamp (TS2) of a
> >>>>>>> particular data element. It succeeds on N1 - fails on N2/3. So the
> write is
> >>>>>>> returned as failed - right ?
> >>>>>>> Now Quorum read comes in for exactly the same piece of data that
> the
> >>>>>>> write failed for.
> >>>>>>> So N1 has TS2 but both N2/3 have the old TS (say TS1)
> >>>>>>> And the read succeeds - Will it return TS1 or TS2.
> >>>>>>> I submit it will return TS1 - the old TS.
> >>>>>>
> >>>>>> It all depends on which (first 2) nodes respond to the read (since
> >>>>>> RF=3, that can any two of N1/N2/N3). If N1 is part of the two that
> makes the
> >>>>>> quorum, then TS2 will be returned, because cassandra will compare
> the
> >>>>>> timestamp and decide what to return based on this. If N2/N3 responds
> >>>>>> however, both timestamp will be TS1 and so, after timestamp
> resolution, it
> >>>>>> will stil be TS1 that will be returned.
> >>>>>> So yes timestamp is used for conflict resolution.
> >>>>>> In your example, you could get TS1 back because a failed write can
> let
> >>>>>> you cluster in an inconsistent state. You'd have to retry the quorum
> and
> >>>>>> only when it succeeds can you be guaranteed that quorum read will
> always
> >>>>>> return TS2.
> >>>>>> This is because when a write fails, Cassandra doesn't guarantee that
> >>>>>> the write did not made it in (there is no revert).
> >>>>>>
> >>>>>>>
> >>>>>>> Are we on the same page with this interpretation ?
> >>>>>>> Regards,
> >>>>>>> -JA
> >>>>>>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne
> >>>>>>> <sy...@datastax.com> wrote:
> >>>>>>>>
> >>>>>>>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John
> >>>>>>>> <ch...@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>> Sylvan,
> >>>>>>>>> Time stamps are not used for conflict resolution - unless is is
> >>>>>>>>> part of the application logic!!!
> >>>>>>>>
> >>>>>>>> What is you definition of conflict resolution ? Because if you
> >>>>>>>> update twice the same column (which
> >>>>>>>> I'll call a conflict), then the timestamps are used to decide
> which
> >>>>>>>> update wins (which I'll call a resolution).
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> You can have "lost updates" w/Cassandra. You need to to use 3rd
> >>>>>>>>> products - cages for e.g. - to get ACID type consistency.
> >>>>>>>>
> >>>>>>>> Then again, you'll have to define what you are calling "lost
> >>>>>>>> updates". Provided you use a reasonable consistency level,
> Cassandra
> >>>>>>>> provides fairly strong durability guarantee, so for some
> definition you
> >>>>>>>> don't "lose updates".
> >>>>>>>> That being said, I never pretended that Cassandra provided any
> ACID
> >>>>>>>> guarantee. ACID relates to transaction, which Cassandra doesn't
> support. If
> >>>>>>>> we're talking about the guarantees of transaction, then by all
> means,
> >>>>>>>> cassandra won't provide it. And yes you can use cages or the like
> to get
> >>>>>>>> transaction. But that was not the point of the thread, was it ?
> The thread
> >>>>>>>> is about vector clocks, and that has nothing to do with
> transaction (vector
> >>>>>>>> clocks certainly don't give you transactions).
> >>>>>>>> Sorry if I wasn't clear in my mail, but I was only responding to
> why
> >>>>>>>> so far I don't think vector clocks would really provide much for
> Cassandra.
> >>>>>>>> --
> >>>>>>>> Sylvain
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> -JA
> >>>>>>>>>
> >>>>>>>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne
> >>>>>>>>> <sy...@datastax.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John
> >>>>>>>>>> <ch...@gmail.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Apologies : For some reason my response on the original mail
> >>>>>>>>>>> keeps bouncing back, thus this new one!
> >>>>>>>>>>>
> >>>>>>>>>>> > From the other hand, the same article says:
> >>>>>>>>>>> > "For conditional writes to work, the condition must be
> >>>>>>>>>>> > evaluated at all update
> >>>>>>>>>>> > sites before the write can be allowed to succeed."
> >>>>>>>>>>> >
> >>>>>>>>>>> > This means, that when doing such an update CL=ALL must be
> used
> >>>>>>>>>>>
> >>>>>>>>>>> Sorry, but I am confused by that entire thread!
> >>>>>>>>>>> Questions:-
> >>>>>>>>>>> 1. Does Cassandra implement any kind of data locking - at any
> >>>>>>>>>>> granularity whether it be row/colF/Col ?
> >>>>>>>>>>
> >>>>>>>>>> No locking, no.
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> 2. If the answer to 1 above is NO! - how does CL ALL prevent
> >>>>>>>>>>> conflicts. Concurrent updates on exactly the same piece of data
> on different
> >>>>>>>>>>> nodes can still mess each other up, right ?
> >>>>>>>>>>
> >>>>>>>>>> Not sure why you are taking CL.ALL specifically. But in any CL,
> >>>>>>>>>> updating the same piece of data means the same column value. In
> that case,
> >>>>>>>>>> the resolution rules are the following:
> >>>>>>>>>>   - If the updates have a different timestamp, keep the one with
> >>>>>>>>>> the higher timestamp. That is, the more recent of two updates
> win.
> >>>>>>>>>>   - It the timestamps are the same, then it compares the values
> >>>>>>>>>> (byte comparison) and keep the highest value. This is just to
> break ties in
> >>>>>>>>>> a consistent manner.
> >>>>>>>>>> So if you do two truly concurrent updates (that is from two
> place
> >>>>>>>>>> at the same instant), then you'll end with one of the update.
> This is the
> >>>>>>>>>> column level.
> >>>>>>>>>> However, if that simple conflict detection/resolution mechanism
> is
> >>>>>>>>>> not good enough for some of your use case and you need to keep
> two
> >>>>>>>>>> concurrent updates, it is easy enough. Just make sure that the
> update don't
> >>>>>>>>>> end up in the same column. This is easily achieved by appending
> some unique
> >>>>>>>>>> identifier to the column name for instance. And when reading, do
> a slice and
> >>>>>>>>>> reconcile whatever you get back with whatever logic make sense.
> If you do
> >>>>>>>>>> that, congrats, you've roughly emulated what vector clocks would
> do. Btw, no
> >>>>>>>>>> locking or anything needed.
> >>>>>>>>>> In my experience, for most things the timestamp resolution is
> >>>>>>>>>> enough. If the same user update twice it's profile picture on
> you web site
> >>>>>>>>>> at the same microsecond, it's usually fine to end up with one of
> the two
> >>>>>>>>>> pictures. In the rare case where you need something more
> specific, using the
> >>>>>>>>>> cassandra data model usually solves the problem easily. The
> reason for not
> >>>>>>>>>> having vector clocks in Cassandra is that so far, we haven't
> really found
> >>>>>>>>>> much example where it is no the case.
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Sylvain
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
> >
>

Re: New Chain for : Does Cassandra use vector clocks

Posted by A J <s5...@gmail.com>.

>>but could be broken in case of a failed write<<
You can think of a scenario where R + W >N still leads to
inconsistency even for successful writes. Say you keep W=1 and R=N .
Lets say the one node where a write happened with success goes down
before it made to the other N-1 nodes. Lets say it goes down for good
and is unrecoverable. The only option is to build a new node from
scratch from other active nodes. This will lead to a write that was
lost and you will end up serving stale copy of it.

It is better to talk in terms of use cases and if cassandra will be a
fit for it. Otherwise unless you have W=R=N and fsync before each
write commit, there will be scope for inconsistency.


On Thu, Feb 24, 2011 at 1:25 PM, Anthony John <ch...@gmail.com> wrote:
> I see the point - apologies for putting everyone through this!
> It was just militating against my mental model.
> In summary, here is my take away - simple stuff but - IMO - important to
> conclude this thread (I hope):-
> 1. I was splitting hair over a failed ( partial ) Q Write. Such an event
> should be immediately followed by the same write going to a connection on to
> another node ( potentially using connection caches of client implementations
> ) or a Read at CL of All. Because a write could have partially gone through.
> 2. Timestamps are used in determining the latest version ( correcting the
> false impression I was propagating)
> Finally, wrt "W + R > N for Q CL statement" holds, but could be broken in
> case of a failed write as it is unsure whether the new value got written on
>  any server or not. Is that a fair characterization ?
> Bottom line - unlike traditional DBMS, errors do not ensure automatic
> cleanup and revert back, app code has to follow up if  immediate - and not
> eventual -  consistency is desired. I made that leap in almost all cases - I
> think - but the case of a failed write.
> My bad and I can live with this!
> Regards,
> -JA
>
> On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne <sy...@datastax.com>
> wrote:
>>
>> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John <ch...@gmail.com>
>> wrote:
>>>
>>> Completely understand!
>>> All that I am quibbling over is whether a CL of quorum guarantees
>>> consistency or not. That is what the documentation says - right. IF for a CL
>>> of Q read - it depends on which node returns read first to determine the
>>> actual returned result or other more convoluted conditions , then a Quorum
>>> read/write is not consistent, by any definition.
>>
>> But that's the point. The definition of consistency we are talking about
>> has no meaning if you consider only a quorum read. The definition (which is
>> the de facto definition of consistency in 'eventually consistent') make
>> sense if we talk about a write followed by a read. And it is
>> considering succeeding write followed by succeeding read.
>> And that is the statement the wiki is making.
>> Honestly, we could debate forever on the definition of consistency and
>> whatnot. Cassandra guaranties that if you do a (succeeding) write on W
>> replica and then a (succeeding) read on R replica and if R+W>N, then it is
>> guaranteed that the read will see the preceding write. And this is what is
>> called consistency in the context of eventual consistency (which is not the
>> context of ACID).
>> If this is not the definition of consistency you had in mind then by all
>> mean, Cassandra probably don't guarantee this definition. But given that the
>> paragraph preceding what you pasted state clearly we are not talking about
>> ACID consistency, but eventual consistency, I don't think the wiki is making
>> any unfair statement.
>> That being said, the wiki may not be always as clear as it could. But it's
>> an editable wiki :)
>> --
>> Sylvain
>>
>>>
>>> I can still use Cassandra, and will use it, luv it!!! But let us not make
>>> this statement on the Wiki architecture section:-
>>> -------------------------------------------------------------
>>>
>>> More specifically: R=read replica count W=write replica
>>> count N=replication factor Q=QUORUM (Q = N / 2 + 1)
>>>
>>> If W + R > N, you will have consistency
>>>
>>> W=1, R=N
>>> W=N, R=1
>>> W=Q, R=Q where Q = N / 2 + 1
>>>
>>> Cassandra provides consistency when R + W > N (read replica count + write
>>> replica count > replication factor).
>>>
>>> ----------------------------------------------------
>>>
>>> .
>>>
>>> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne <sy...@datastax.com>
>>> wrote:
>>>>
>>>> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John <ch...@gmail.com>
>>>> wrote:
>>>>>
>>>>> If you are correct and you are probably closer to the code - then CL of
>>>>> Quorum does not guarantee a consistency.
>>>>
>>>> If the operation succeed, it does (for some definition of consistency
>>>> which is, following reads at Quorum will be guaranteed to see the new value
>>>> of a update at quorum). If it fails, then no, it does not guarantee
>>>> consistency.
>>>> It is important to note that the word consistency has multiple meaning.
>>>> In particular, when we are talking of consistency in Cassandra, we are not
>>>> talking of the same definition as the C in ACID
>>>> (see: http://www.allthingsdistributed.com/2007/12/eventually_consistent.html)
>>>>>
>>>>> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne
>>>>> <sy...@datastax.com> wrote:
>>>>>>
>>>>>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John <ch...@gmail.com>
>>>>>> wrote:
>>>>>>>>
>>>>>>>> >>Time stamps are not used for conflict resolution - unless is is
>>>>>>>> >> part of the application logic!!!
>>>>>>>
>>>>>>> >>What is you definition of conflict resolution ? Because if you
>>>>>>> >> update twice the same column (which
>>>>>>> >>I'll call a conflict), then the timestamps are used to decide which
>>>>>>> >> update wins (which I'll call a resolution).
>>>>>>> I understand what you are saying, and yes semantics is very important
>>>>>>> here. And yes we are responding to the immediate questions without covering
>>>>>>> all questions in the thread.
>>>>>>> The point being made here is that the timestamp of the column is not
>>>>>>> used by Cassandra to figure out what data to return.
>>>>>>
>>>>>> Not quite true.
>>>>>>>
>>>>>>> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
>>>>>>> A Quorum  Write comes and add/updates the time stamp (TS2) of a
>>>>>>> particular data element. It succeeds on N1 - fails on N2/3. So the write is
>>>>>>> returned as failed - right ?
>>>>>>> Now Quorum read comes in for exactly the same piece of data that the
>>>>>>> write failed for.
>>>>>>> So N1 has TS2 but both N2/3 have the old TS (say TS1)
>>>>>>> And the read succeeds - Will it return TS1 or TS2.
>>>>>>> I submit it will return TS1 - the old TS.
>>>>>>
>>>>>> It all depends on which (first 2) nodes respond to the read (since
>>>>>> RF=3, that can any two of N1/N2/N3). If N1 is part of the two that makes the
>>>>>> quorum, then TS2 will be returned, because cassandra will compare the
>>>>>> timestamp and decide what to return based on this. If N2/N3 responds
>>>>>> however, both timestamp will be TS1 and so, after timestamp resolution, it
>>>>>> will stil be TS1 that will be returned.
>>>>>> So yes timestamp is used for conflict resolution.
>>>>>> In your example, you could get TS1 back because a failed write can let
>>>>>> you cluster in an inconsistent state. You'd have to retry the quorum and
>>>>>> only when it succeeds can you be guaranteed that quorum read will always
>>>>>> return TS2.
>>>>>> This is because when a write fails, Cassandra doesn't guarantee that
>>>>>> the write did not made it in (there is no revert).
>>>>>>
>>>>>>>
>>>>>>> Are we on the same page with this interpretation ?
>>>>>>> Regards,
>>>>>>> -JA
>>>>>>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne
>>>>>>> <sy...@datastax.com> wrote:
>>>>>>>>
>>>>>>>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John
>>>>>>>> <ch...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> Sylvan,
>>>>>>>>> Time stamps are not used for conflict resolution - unless is is
>>>>>>>>> part of the application logic!!!
>>>>>>>>
>>>>>>>> What is you definition of conflict resolution ? Because if you
>>>>>>>> update twice the same column (which
>>>>>>>> I'll call a conflict), then the timestamps are used to decide which
>>>>>>>> update wins (which I'll call a resolution).
>>>>>>>>
>>>>>>>>>
>>>>>>>>> You can have "lost updates" w/Cassandra. You need to to use 3rd
>>>>>>>>> products - cages for e.g. - to get ACID type consistency.
>>>>>>>>
>>>>>>>> Then again, you'll have to define what you are calling "lost
>>>>>>>> updates". Provided you use a reasonable consistency level, Cassandra
>>>>>>>> provides fairly strong durability guarantee, so for some definition you
>>>>>>>> don't "lose updates".
>>>>>>>> That being said, I never pretended that Cassandra provided any ACID
>>>>>>>> guarantee. ACID relates to transaction, which Cassandra doesn't support. If
>>>>>>>> we're talking about the guarantees of transaction, then by all means,
>>>>>>>> cassandra won't provide it. And yes you can use cages or the like to get
>>>>>>>> transaction. But that was not the point of the thread, was it ? The thread
>>>>>>>> is about vector clocks, and that has nothing to do with transaction (vector
>>>>>>>> clocks certainly don't give you transactions).
>>>>>>>> Sorry if I wasn't clear in my mail, but I was only responding to why
>>>>>>>> so far I don't think vector clocks would really provide much for Cassandra.
>>>>>>>> --
>>>>>>>> Sylvain
>>>>>>>>
>>>>>>>>>
>>>>>>>>> -JA
>>>>>>>>>
>>>>>>>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne
>>>>>>>>> <sy...@datastax.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John
>>>>>>>>>> <ch...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Apologies : For some reason my response on the original mail
>>>>>>>>>>> keeps bouncing back, thus this new one!
>>>>>>>>>>>
>>>>>>>>>>> > From the other hand, the same article says:
>>>>>>>>>>> > "For conditional writes to work, the condition must be
>>>>>>>>>>> > evaluated at all update
>>>>>>>>>>> > sites before the write can be allowed to succeed."
>>>>>>>>>>> >
>>>>>>>>>>> > This means, that when doing such an update CL=ALL must be used
>>>>>>>>>>>
>>>>>>>>>>> Sorry, but I am confused by that entire thread!
>>>>>>>>>>> Questions:-
>>>>>>>>>>> 1. Does Cassandra implement any kind of data locking - at any
>>>>>>>>>>> granularity whether it be row/colF/Col ?
>>>>>>>>>>
>>>>>>>>>> No locking, no.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2. If the answer to 1 above is NO! - how does CL ALL prevent
>>>>>>>>>>> conflicts. Concurrent updates on exactly the same piece of data on different
>>>>>>>>>>> nodes can still mess each other up, right ?
>>>>>>>>>>
>>>>>>>>>> Not sure why you are taking CL.ALL specifically. But in any CL,
>>>>>>>>>> updating the same piece of data means the same column value. In that case,
>>>>>>>>>> the resolution rules are the following:
>>>>>>>>>>   - If the updates have a different timestamp, keep the one with
>>>>>>>>>> the higher timestamp. That is, the more recent of two updates win.
>>>>>>>>>>   - It the timestamps are the same, then it compares the values
>>>>>>>>>> (byte comparison) and keep the highest value. This is just to break ties in
>>>>>>>>>> a consistent manner.
>>>>>>>>>> So if you do two truly concurrent updates (that is from two place
>>>>>>>>>> at the same instant), then you'll end with one of the update. This is the
>>>>>>>>>> column level.
>>>>>>>>>> However, if that simple conflict detection/resolution mechanism is
>>>>>>>>>> not good enough for some of your use case and you need to keep two
>>>>>>>>>> concurrent updates, it is easy enough. Just make sure that the update don't
>>>>>>>>>> end up in the same column. This is easily achieved by appending some unique
>>>>>>>>>> identifier to the column name for instance. And when reading, do a slice and
>>>>>>>>>> reconcile whatever you get back with whatever logic make sense. If you do
>>>>>>>>>> that, congrats, you've roughly emulated what vector clocks would do. Btw, no
>>>>>>>>>> locking or anything needed.
>>>>>>>>>> In my experience, for most things the timestamp resolution is
>>>>>>>>>> enough. If the same user update twice it's profile picture on you web site
>>>>>>>>>> at the same microsecond, it's usually fine to end up with one of the two
>>>>>>>>>> pictures. In the rare case where you need something more specific, using the
>>>>>>>>>> cassandra data model usually solves the problem easily. The reason for not
>>>>>>>>>> having vector clocks in Cassandra is that so far, we haven't really found
>>>>>>>>>> much example where it is no the case.
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Sylvain
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>

Re: New Chain for : Does Cassandra use vector clocks

Posted by Ritesh Tijoriwala <ti...@gmail.com>.

Thanks all for good detail and clarification. I just wanted to get things
clear and understand correctly what is the expected behavior when working
with Cassandra against various failure conditions so that application can be
designed accordingly and provide proper locking/synchronization if required.

Thanks,
Ritesh

On Thu, Feb 24, 2011 at 10:25 AM, Anthony John <ch...@gmail.com>wrote:

> I see the point - apologies for putting everyone through this!
>
> It was just militating against my mental model.
>
> In summary, here is my take away - simple stuff but - IMO - important to
> conclude this thread (I hope):-
> 1. I was splitting hair over a failed ( partial ) Q Write. Such an event
> should be immediately followed by the same write going to a connection on to
> another node ( potentially using connection caches of client implementations
> ) or a Read at CL of All. Because a write could have partially gone through.
> 2. Timestamps are used in determining the latest version ( correcting the
> false impression I was propagating)
>
> Finally, wrt "W + R > N for Q CL statement" holds, but could be broken in
> case of a failed write as it is unsure whether the new value got written on
>  any server or not. Is that a fair characterization ?
>
> Bottom line - unlike traditional DBMS, errors do not ensure automatic
> cleanup and revert back, app code has to follow up if  immediate - and not
> eventual -  consistency is desired. I made that leap in almost all cases - I
> think - but the case of a failed write.
>
> My bad and I can live with this!
>
> Regards,
>
> -JA
>
>
> On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne <sy...@datastax.com>wrote:
>
>> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John <ch...@gmail.com>wrote:
>>
>>> Completely understand!
>>>
>>> All that I am quibbling over is whether a CL of quorum guarantees
>>> consistency or not. That is what the documentation says - right. IF for a CL
>>> of Q read - it depends on which node returns read first to determine the
>>> actual returned result or other more convoluted conditions , then a Quorum
>>> read/write is not consistent, by any definition.
>>>
>>
>> But that's the point. The definition of consistency we are talking about
>> has no meaning if you consider only a quorum read. The definition (which is
>> the de facto definition of consistency in 'eventually consistent') make
>> sense if we talk about a write followed by a read. And it is
>> considering succeeding write followed by succeeding read.
>> And that is the statement the wiki is making.
>>
>> Honestly, we could debate forever on the definition of consistency and
>> whatnot. Cassandra guaranties that if you do a (succeeding) write on W
>> replica and then a (succeeding) read on R replica and if R+W>N, then it is
>> guaranteed that the read will see the preceding write. And this is what is
>> called consistency in the context of eventual consistency (which is not the
>> context of ACID).
>>
>> If this is not the definition of consistency you had in mind then by all
>> mean, Cassandra probably don't guarantee this definition. But given that the
>> paragraph preceding what you pasted state clearly we are not talking about
>> ACID consistency, but eventual consistency, I don't think the wiki is making
>> any unfair statement.
>>
>> That being said, the wiki may not be always as clear as it could. But it's
>> an editable wiki :)
>>
>> --
>> Sylvain
>>
>>
>>>
>>> I can still use Cassandra, and will use it, luv it!!! But let us not make
>>> this statement on the Wiki architecture section:-
>>>
>>> -------------------------------------------------------------
>>>
>>> More specifically: R=read replica count W=write replica count N=replication
>>> factor Q=*QUORUM* (Q = N / 2 + 1)
>>>
>>>    -
>>>
>>>    If W + R > N, you will have consistency
>>>    - W=1, R=N
>>>    - W=N, R=1
>>>    - W=Q, R=Q where Q = N / 2 + 1
>>>
>>> Cassandra provides consistency when R + W > N (read replica count + write
>>> replica count > replication factor).
>>>
>>> ----------------------------------------------------
>>>
>>>
>>> .
>>>
>>>
>>> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne <sylvain@datastax.com
>>> > wrote:
>>>
>>>> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John <ch...@gmail.com>wrote:
>>>>
>>>>> If you are correct and you are probably closer to the code - then CL of
>>>>> Quorum does not guarantee a consistency.
>>>>
>>>>
>>>> If the operation succeed, it does (for some definition of consistency
>>>> which is, following reads at Quorum will be guaranteed to see the new value
>>>> of a update at quorum). If it fails, then no, it does not guarantee
>>>> consistency.
>>>>
>>>> It is important to note that the word consistency has multiple meaning.
>>>> In particular, when we are talking of consistency in Cassandra, we are not
>>>> talking of the same definition as the C in ACID (see:
>>>> http://www.allthingsdistributed.com/2007/12/eventually_consistent.html)
>>>>
>>>>>
>>>>> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne <
>>>>> sylvain@datastax.com> wrote:
>>>>>
>>>>>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John <ch...@gmail.com>wrote:
>>>>>>
>>>>>>>  >>Time stamps are not used for conflict resolution - unless is is
>>>>>>>> part of the application logic!!!
>>>>>>>>
>>>>>>>
>>>>>>> >>What is you definition of conflict resolution ? Because if you
>>>>>>> update twice the same column (which
>>>>>>> >>I'll call a conflict), then the timestamps are used to decide which
>>>>>>> update wins (which I'll call a resolution).
>>>>>>>
>>>>>>> I understand what you are saying, and yes semantics is very important
>>>>>>> here. And yes we are responding to the immediate questions without covering
>>>>>>> all questions in the thread.
>>>>>>>
>>>>>>> The point being made here is that the timestamp of the column is not
>>>>>>> used by Cassandra to figure out what data to return.
>>>>>>>
>>>>>>
>>>>>> Not quite true.
>>>>>>
>>>>>>
>>>>>>> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
>>>>>>> A Quorum  Write comes and add/updates the time stamp (TS2) of a
>>>>>>> particular data element. It succeeds on N1 - fails on N2/3. So the write is
>>>>>>> returned as failed - right ?
>>>>>>> Now Quorum read comes in for exactly the same piece of data that the
>>>>>>> write failed for.
>>>>>>> So N1 has TS2 but both N2/3 have the old TS (say TS1)
>>>>>>> And the read succeeds - Will it return TS1 or TS2.
>>>>>>>
>>>>>>> I submit it will return TS1 - the old TS.
>>>>>>>
>>>>>>
>>>>>> It all depends on which (first 2) nodes respond to the read (since
>>>>>> RF=3, that can any two of N1/N2/N3). If N1 is part of the two that makes the
>>>>>> quorum, then TS2 will be returned, because cassandra will compare the
>>>>>> timestamp and decide what to return based on this. If N2/N3 responds
>>>>>> however, both timestamp will be TS1 and so, after timestamp resolution, it
>>>>>> will stil be TS1 that will be returned.
>>>>>> So yes timestamp is used for conflict resolution.
>>>>>>
>>>>>> In your example, you could get TS1 back because a failed write can let
>>>>>> you cluster in an inconsistent state. You'd have to retry the quorum and
>>>>>> only when it succeeds can you be guaranteed that quorum read will always
>>>>>> return TS2.
>>>>>>
>>>>>> This is because when a write fails, Cassandra doesn't guarantee that
>>>>>> the write did not made it in (there is no revert).
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Are we on the same page with this interpretation ?
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> -JA
>>>>>>>
>>>>>>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne <
>>>>>>> sylvain@datastax.com> wrote:
>>>>>>>
>>>>>>>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John <
>>>>>>>> chirayithaj@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Sylvan,
>>>>>>>>>
>>>>>>>>> Time stamps are not used for conflict resolution - unless is is
>>>>>>>>> part of the application logic!!!
>>>>>>>>>
>>>>>>>>
>>>>>>>> What is you definition of conflict resolution ? Because if you
>>>>>>>> update twice the same column (which
>>>>>>>> I'll call a conflict), then the timestamps are used to decide which
>>>>>>>> update wins (which I'll call a resolution).
>>>>>>>>
>>>>>>>>
>>>>>>>>> You can have "lost updates" w/Cassandra. You need to to use 3rd
>>>>>>>>> products - cages for e.g. - to get ACID type consistency.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Then again, you'll have to define what you are calling "lost
>>>>>>>> updates". Provided you use a reasonable consistency level, Cassandra
>>>>>>>> provides fairly strong durability guarantee, so for some definition you
>>>>>>>> don't "lose updates".
>>>>>>>>
>>>>>>>> That being said, I never pretended that Cassandra provided any ACID
>>>>>>>> guarantee. ACID relates to transaction, which Cassandra doesn't support. If
>>>>>>>> we're talking about the guarantees of transaction, then by all means,
>>>>>>>> cassandra won't provide it. And yes you can use cages or the like to get
>>>>>>>> transaction. But that was not the point of the thread, was it ? The thread
>>>>>>>> is about vector clocks, and that has nothing to do with transaction (vector
>>>>>>>> clocks certainly don't give you transactions).
>>>>>>>>
>>>>>>>> Sorry if I wasn't clear in my mail, but I was only responding to why
>>>>>>>> so far I don't think vector clocks would really provide much for Cassandra.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Sylvain
>>>>>>>>
>>>>>>>>
>>>>>>>>> -JA
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne <
>>>>>>>>> sylvain@datastax.com> wrote:
>>>>>>>>>
>>>>>>>>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John <
>>>>>>>>>> chirayithaj@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Apologies : For some reason my response on the original mail
>>>>>>>>>>> keeps bouncing back, thus this new one!
>>>>>>>>>>> > From the other hand, the same article says:
>>>>>>>>>>> > "For conditional writes to work, the condition must be
>>>>>>>>>>> evaluated at all update
>>>>>>>>>>> > sites before the write can be allowed to succeed."
>>>>>>>>>>> >
>>>>>>>>>>> > This means, that when doing such an update CL=ALL must be used
>>>>>>>>>>>
>>>>>>>>>>> Sorry, but I am confused by that entire thread!
>>>>>>>>>>>
>>>>>>>>>>> Questions:-
>>>>>>>>>>> 1. Does Cassandra implement any kind of data locking - at any
>>>>>>>>>>> granularity whether it be row/colF/Col ?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> No locking, no.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> 2. If the answer to 1 above is NO! - how does CL ALL prevent
>>>>>>>>>>> conflicts. Concurrent updates on exactly the same piece of data on different
>>>>>>>>>>> nodes can still mess each other up, right ?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Not sure why you are taking CL.ALL specifically. But in any CL,
>>>>>>>>>> updating the same piece of data means the same column value. In that case,
>>>>>>>>>> the resolution rules are the following:
>>>>>>>>>>    - If the updates have a different timestamp, keep the one with
>>>>>>>>>> the higher timestamp. That is, the more recent of two updates win.
>>>>>>>>>>   - It the timestamps are the same, then it compares the values
>>>>>>>>>> (byte comparison) and keep the highest value. This is just to break ties in
>>>>>>>>>> a consistent manner.
>>>>>>>>>>
>>>>>>>>>> So if you do two truly concurrent updates (that is from two place
>>>>>>>>>> at the same instant), then you'll end with one of the update. This is the
>>>>>>>>>> column level.
>>>>>>>>>>
>>>>>>>>>> However, if that simple conflict detection/resolution mechanism is
>>>>>>>>>> not good enough for some of your use case and you need to keep two
>>>>>>>>>> concurrent updates, it is easy enough. Just make sure that the update don't
>>>>>>>>>> end up in the same column. This is easily achieved by appending some unique
>>>>>>>>>> identifier to the column name for instance. And when reading, do a slice and
>>>>>>>>>> reconcile whatever you get back with whatever logic make sense. If you do
>>>>>>>>>> that, congrats, you've roughly emulated what vector clocks would do. Btw, no
>>>>>>>>>> locking or anything needed.
>>>>>>>>>>
>>>>>>>>>> In my experience, for most things the timestamp resolution is
>>>>>>>>>> enough. If the same user update twice it's profile picture on you web site
>>>>>>>>>> at the same microsecond, it's usually fine to end up with one of the two
>>>>>>>>>> pictures. In the rare case where you need something more specific, using the
>>>>>>>>>> cassandra data model usually solves the problem easily. The reason for not
>>>>>>>>>> having vector clocks in Cassandra is that so far, we haven't really found
>>>>>>>>>> much example where it is no the case.
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Sylvain
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: New Chain for : Does Cassandra use vector clocks

Posted by Anthony John <ch...@gmail.com>.

I see the point - apologies for putting everyone through this!

It was just militating against my mental model.

In summary, here is my take away - simple stuff but - IMO - important to
conclude this thread (I hope):-
1. I was splitting hair over a failed ( partial ) Q Write. Such an event
should be immediately followed by the same write going to a connection on to
another node ( potentially using connection caches of client implementations
) or a Read at CL of All. Because a write could have partially gone through.
2. Timestamps are used in determining the latest version ( correcting the
false impression I was propagating)

Finally, wrt "W + R > N for Q CL statement" holds, but could be broken in
case of a failed write as it is unsure whether the new value got written on
 any server or not. Is that a fair characterization ?

Bottom line - unlike traditional DBMS, errors do not ensure automatic
cleanup and revert back, app code has to follow up if  immediate - and not
eventual -  consistency is desired. I made that leap in almost all cases - I
think - but the case of a failed write.

My bad and I can live with this!

Regards,

-JA

On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne <sy...@datastax.com>wrote:

> On Thu, Feb 24, 2011 at 6:33 PM, Anthony John <ch...@gmail.com>wrote:
>
>> Completely understand!
>>
>> All that I am quibbling over is whether a CL of quorum guarantees
>> consistency or not. That is what the documentation says - right. IF for a CL
>> of Q read - it depends on which node returns read first to determine the
>> actual returned result or other more convoluted conditions , then a Quorum
>> read/write is not consistent, by any definition.
>>
>
> But that's the point. The definition of consistency we are talking about
> has no meaning if you consider only a quorum read. The definition (which is
> the de facto definition of consistency in 'eventually consistent') make
> sense if we talk about a write followed by a read. And it is
> considering succeeding write followed by succeeding read.
> And that is the statement the wiki is making.
>
> Honestly, we could debate forever on the definition of consistency and
> whatnot. Cassandra guaranties that if you do a (succeeding) write on W
> replica and then a (succeeding) read on R replica and if R+W>N, then it is
> guaranteed that the read will see the preceding write. And this is what is
> called consistency in the context of eventual consistency (which is not the
> context of ACID).
>
> If this is not the definition of consistency you had in mind then by all
> mean, Cassandra probably don't guarantee this definition. But given that the
> paragraph preceding what you pasted state clearly we are not talking about
> ACID consistency, but eventual consistency, I don't think the wiki is making
> any unfair statement.
>
> That being said, the wiki may not be always as clear as it could. But it's
> an editable wiki :)
>
> --
> Sylvain
>
>
>>
>> I can still use Cassandra, and will use it, luv it!!! But let us not make
>> this statement on the Wiki architecture section:-
>>
>> -------------------------------------------------------------
>>
>> More specifically: R=read replica count W=write replica count N=replication
>> factor Q=*QUORUM* (Q = N / 2 + 1)
>>
>>    -
>>
>>    If W + R > N, you will have consistency
>>    - W=1, R=N
>>    - W=N, R=1
>>    - W=Q, R=Q where Q = N / 2 + 1
>>
>> Cassandra provides consistency when R + W > N (read replica count + write
>> replica count > replication factor).
>>
>> ----------------------------------------------------
>>
>>
>> .
>>
>>
>> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne <sy...@datastax.com>wrote:
>>
>>> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John <ch...@gmail.com>wrote:
>>>
>>>> If you are correct and you are probably closer to the code - then CL of
>>>> Quorum does not guarantee a consistency.
>>>
>>>
>>> If the operation succeed, it does (for some definition of consistency
>>> which is, following reads at Quorum will be guaranteed to see the new value
>>> of a update at quorum). If it fails, then no, it does not guarantee
>>> consistency.
>>>
>>> It is important to note that the word consistency has multiple meaning.
>>> In particular, when we are talking of consistency in Cassandra, we are not
>>> talking of the same definition as the C in ACID (see:
>>> http://www.allthingsdistributed.com/2007/12/eventually_consistent.html)
>>>
>>>>
>>>> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne <
>>>> sylvain@datastax.com> wrote:
>>>>
>>>>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John <ch...@gmail.com>wrote:
>>>>>
>>>>>>  >>Time stamps are not used for conflict resolution - unless is is
>>>>>>> part of the application logic!!!
>>>>>>>
>>>>>>
>>>>>> >>What is you definition of conflict resolution ? Because if you
>>>>>> update twice the same column (which
>>>>>> >>I'll call a conflict), then the timestamps are used to decide which
>>>>>> update wins (which I'll call a resolution).
>>>>>>
>>>>>> I understand what you are saying, and yes semantics is very important
>>>>>> here. And yes we are responding to the immediate questions without covering
>>>>>> all questions in the thread.
>>>>>>
>>>>>> The point being made here is that the timestamp of the column is not
>>>>>> used by Cassandra to figure out what data to return.
>>>>>>
>>>>>
>>>>> Not quite true.
>>>>>
>>>>>
>>>>>> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
>>>>>> A Quorum  Write comes and add/updates the time stamp (TS2) of a
>>>>>> particular data element. It succeeds on N1 - fails on N2/3. So the write is
>>>>>> returned as failed - right ?
>>>>>> Now Quorum read comes in for exactly the same piece of data that the
>>>>>> write failed for.
>>>>>> So N1 has TS2 but both N2/3 have the old TS (say TS1)
>>>>>> And the read succeeds - Will it return TS1 or TS2.
>>>>>>
>>>>>> I submit it will return TS1 - the old TS.
>>>>>>
>>>>>
>>>>> It all depends on which (first 2) nodes respond to the read (since
>>>>> RF=3, that can any two of N1/N2/N3). If N1 is part of the two that makes the
>>>>> quorum, then TS2 will be returned, because cassandra will compare the
>>>>> timestamp and decide what to return based on this. If N2/N3 responds
>>>>> however, both timestamp will be TS1 and so, after timestamp resolution, it
>>>>> will stil be TS1 that will be returned.
>>>>> So yes timestamp is used for conflict resolution.
>>>>>
>>>>> In your example, you could get TS1 back because a failed write can let
>>>>> you cluster in an inconsistent state. You'd have to retry the quorum and
>>>>> only when it succeeds can you be guaranteed that quorum read will always
>>>>> return TS2.
>>>>>
>>>>> This is because when a write fails, Cassandra doesn't guarantee that
>>>>> the write did not made it in (there is no revert).
>>>>>
>>>>>
>>>>>>
>>>>>> Are we on the same page with this interpretation ?
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> -JA
>>>>>>
>>>>>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne <
>>>>>> sylvain@datastax.com> wrote:
>>>>>>
>>>>>>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John <chirayithaj@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Sylvan,
>>>>>>>>
>>>>>>>> Time stamps are not used for conflict resolution - unless is is part
>>>>>>>> of the application logic!!!
>>>>>>>>
>>>>>>>
>>>>>>> What is you definition of conflict resolution ? Because if you update
>>>>>>> twice the same column (which
>>>>>>> I'll call a conflict), then the timestamps are used to decide which
>>>>>>> update wins (which I'll call a resolution).
>>>>>>>
>>>>>>>
>>>>>>>> You can have "lost updates" w/Cassandra. You need to to use 3rd
>>>>>>>> products - cages for e.g. - to get ACID type consistency.
>>>>>>>>
>>>>>>>
>>>>>>> Then again, you'll have to define what you are calling "lost
>>>>>>> updates". Provided you use a reasonable consistency level, Cassandra
>>>>>>> provides fairly strong durability guarantee, so for some definition you
>>>>>>> don't "lose updates".
>>>>>>>
>>>>>>> That being said, I never pretended that Cassandra provided any ACID
>>>>>>> guarantee. ACID relates to transaction, which Cassandra doesn't support. If
>>>>>>> we're talking about the guarantees of transaction, then by all means,
>>>>>>> cassandra won't provide it. And yes you can use cages or the like to get
>>>>>>> transaction. But that was not the point of the thread, was it ? The thread
>>>>>>> is about vector clocks, and that has nothing to do with transaction (vector
>>>>>>> clocks certainly don't give you transactions).
>>>>>>>
>>>>>>> Sorry if I wasn't clear in my mail, but I was only responding to why
>>>>>>> so far I don't think vector clocks would really provide much for Cassandra.
>>>>>>>
>>>>>>> --
>>>>>>> Sylvain
>>>>>>>
>>>>>>>
>>>>>>>> -JA
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne <
>>>>>>>> sylvain@datastax.com> wrote:
>>>>>>>>
>>>>>>>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John <
>>>>>>>>> chirayithaj@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Apologies : For some reason my response on the original mail keeps
>>>>>>>>>> bouncing back, thus this new one!
>>>>>>>>>> > From the other hand, the same article says:
>>>>>>>>>> > "For conditional writes to work, the condition must be evaluated
>>>>>>>>>> at all update
>>>>>>>>>> > sites before the write can be allowed to succeed."
>>>>>>>>>> >
>>>>>>>>>> > This means, that when doing such an update CL=ALL must be used
>>>>>>>>>>
>>>>>>>>>> Sorry, but I am confused by that entire thread!
>>>>>>>>>>
>>>>>>>>>> Questions:-
>>>>>>>>>> 1. Does Cassandra implement any kind of data locking - at any
>>>>>>>>>> granularity whether it be row/colF/Col ?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> No locking, no.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> 2. If the answer to 1 above is NO! - how does CL ALL prevent
>>>>>>>>>> conflicts. Concurrent updates on exactly the same piece of data on different
>>>>>>>>>> nodes can still mess each other up, right ?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Not sure why you are taking CL.ALL specifically. But in any CL,
>>>>>>>>> updating the same piece of data means the same column value. In that case,
>>>>>>>>> the resolution rules are the following:
>>>>>>>>>    - If the updates have a different timestamp, keep the one with
>>>>>>>>> the higher timestamp. That is, the more recent of two updates win.
>>>>>>>>>   - It the timestamps are the same, then it compares the values
>>>>>>>>> (byte comparison) and keep the highest value. This is just to break ties in
>>>>>>>>> a consistent manner.
>>>>>>>>>
>>>>>>>>> So if you do two truly concurrent updates (that is from two place
>>>>>>>>> at the same instant), then you'll end with one of the update. This is the
>>>>>>>>> column level.
>>>>>>>>>
>>>>>>>>> However, if that simple conflict detection/resolution mechanism is
>>>>>>>>> not good enough for some of your use case and you need to keep two
>>>>>>>>> concurrent updates, it is easy enough. Just make sure that the update don't
>>>>>>>>> end up in the same column. This is easily achieved by appending some unique
>>>>>>>>> identifier to the column name for instance. And when reading, do a slice and
>>>>>>>>> reconcile whatever you get back with whatever logic make sense. If you do
>>>>>>>>> that, congrats, you've roughly emulated what vector clocks would do. Btw, no
>>>>>>>>> locking or anything needed.
>>>>>>>>>
>>>>>>>>> In my experience, for most things the timestamp resolution is
>>>>>>>>> enough. If the same user update twice it's profile picture on you web site
>>>>>>>>> at the same microsecond, it's usually fine to end up with one of the two
>>>>>>>>> pictures. In the rare case where you need something more specific, using the
>>>>>>>>> cassandra data model usually solves the problem easily. The reason for not
>>>>>>>>> having vector clocks in Cassandra is that so far, we haven't really found
>>>>>>>>> much example where it is no the case.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Sylvain
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: New Chain for : Does Cassandra use vector clocks

Posted by Sylvain Lebresne <sy...@datastax.com>.

On Thu, Feb 24, 2011 at 6:33 PM, Anthony John <ch...@gmail.com> wrote:

> Completely understand!
>
> All that I am quibbling over is whether a CL of quorum guarantees
> consistency or not. That is what the documentation says - right. IF for a CL
> of Q read - it depends on which node returns read first to determine the
> actual returned result or other more convoluted conditions , then a Quorum
> read/write is not consistent, by any definition.
>

But that's the point. The definition of consistency we are talking about has
no meaning if you consider only a quorum read. The definition (which is the
de facto definition of consistency in 'eventually consistent') make sense if
we talk about a write followed by a read. And it is
considering succeeding write followed by succeeding read.
And that is the statement the wiki is making.

Honestly, we could debate forever on the definition of consistency and
whatnot. Cassandra guaranties that if you do a (succeeding) write on W
replica and then a (succeeding) read on R replica and if R+W>N, then it is
guaranteed that the read will see the preceding write. And this is what is
called consistency in the context of eventual consistency (which is not the
context of ACID).

If this is not the definition of consistency you had in mind then by all
mean, Cassandra probably don't guarantee this definition. But given that the
paragraph preceding what you pasted state clearly we are not talking about
ACID consistency, but eventual consistency, I don't think the wiki is making
any unfair statement.

That being said, the wiki may not be always as clear as it could. But it's
an editable wiki :)

--
Sylvain


>
> I can still use Cassandra, and will use it, luv it!!! But let us not make
> this statement on the Wiki architecture section:-
>
> -------------------------------------------------------------
>
> More specifically: R=read replica count W=write replica count N=replication
> factor Q=*QUORUM* (Q = N / 2 + 1)
>
>    -
>
>    If W + R > N, you will have consistency
>    - W=1, R=N
>    - W=N, R=1
>    - W=Q, R=Q where Q = N / 2 + 1
>
> Cassandra provides consistency when R + W > N (read replica count + write
> replica count > replication factor).
>
> ----------------------------------------------------
>
>
> .
>
>
> On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne <sy...@datastax.com>wrote:
>
>> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John <ch...@gmail.com>wrote:
>>
>>> If you are correct and you are probably closer to the code - then CL of
>>> Quorum does not guarantee a consistency.
>>
>>
>> If the operation succeed, it does (for some definition of consistency
>> which is, following reads at Quorum will be guaranteed to see the new value
>> of a update at quorum). If it fails, then no, it does not guarantee
>> consistency.
>>
>> It is important to note that the word consistency has multiple meaning. In
>> particular, when we are talking of consistency in Cassandra, we are not
>> talking of the same definition as the C in ACID (see:
>> http://www.allthingsdistributed.com/2007/12/eventually_consistent.html)
>>
>>>
>>> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne <sylvain@datastax.com
>>> > wrote:
>>>
>>>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John <ch...@gmail.com>wrote:
>>>>
>>>>>  >>Time stamps are not used for conflict resolution - unless is is
>>>>>> part of the application logic!!!
>>>>>>
>>>>>
>>>>> >>What is you definition of conflict resolution ? Because if you update
>>>>> twice the same column (which
>>>>> >>I'll call a conflict), then the timestamps are used to decide which
>>>>> update wins (which I'll call a resolution).
>>>>>
>>>>> I understand what you are saying, and yes semantics is very important
>>>>> here. And yes we are responding to the immediate questions without covering
>>>>> all questions in the thread.
>>>>>
>>>>> The point being made here is that the timestamp of the column is not
>>>>> used by Cassandra to figure out what data to return.
>>>>>
>>>>
>>>> Not quite true.
>>>>
>>>>
>>>>> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
>>>>> A Quorum  Write comes and add/updates the time stamp (TS2) of a
>>>>> particular data element. It succeeds on N1 - fails on N2/3. So the write is
>>>>> returned as failed - right ?
>>>>> Now Quorum read comes in for exactly the same piece of data that the
>>>>> write failed for.
>>>>> So N1 has TS2 but both N2/3 have the old TS (say TS1)
>>>>> And the read succeeds - Will it return TS1 or TS2.
>>>>>
>>>>> I submit it will return TS1 - the old TS.
>>>>>
>>>>
>>>> It all depends on which (first 2) nodes respond to the read (since RF=3,
>>>> that can any two of N1/N2/N3). If N1 is part of the two that makes the
>>>> quorum, then TS2 will be returned, because cassandra will compare the
>>>> timestamp and decide what to return based on this. If N2/N3 responds
>>>> however, both timestamp will be TS1 and so, after timestamp resolution, it
>>>> will stil be TS1 that will be returned.
>>>> So yes timestamp is used for conflict resolution.
>>>>
>>>> In your example, you could get TS1 back because a failed write can let
>>>> you cluster in an inconsistent state. You'd have to retry the quorum and
>>>> only when it succeeds can you be guaranteed that quorum read will always
>>>> return TS2.
>>>>
>>>> This is because when a write fails, Cassandra doesn't guarantee that the
>>>> write did not made it in (there is no revert).
>>>>
>>>>
>>>>>
>>>>> Are we on the same page with this interpretation ?
>>>>>
>>>>> Regards,
>>>>>
>>>>> -JA
>>>>>
>>>>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne <
>>>>> sylvain@datastax.com> wrote:
>>>>>
>>>>>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John <ch...@gmail.com>wrote:
>>>>>>
>>>>>>> Sylvan,
>>>>>>>
>>>>>>> Time stamps are not used for conflict resolution - unless is is part
>>>>>>> of the application logic!!!
>>>>>>>
>>>>>>
>>>>>> What is you definition of conflict resolution ? Because if you update
>>>>>> twice the same column (which
>>>>>> I'll call a conflict), then the timestamps are used to decide which
>>>>>> update wins (which I'll call a resolution).
>>>>>>
>>>>>>
>>>>>>> You can have "lost updates" w/Cassandra. You need to to use 3rd
>>>>>>> products - cages for e.g. - to get ACID type consistency.
>>>>>>>
>>>>>>
>>>>>> Then again, you'll have to define what you are calling "lost updates".
>>>>>> Provided you use a reasonable consistency level, Cassandra provides fairly
>>>>>> strong durability guarantee, so for some definition you don't "lose
>>>>>> updates".
>>>>>>
>>>>>> That being said, I never pretended that Cassandra provided any ACID
>>>>>> guarantee. ACID relates to transaction, which Cassandra doesn't support. If
>>>>>> we're talking about the guarantees of transaction, then by all means,
>>>>>> cassandra won't provide it. And yes you can use cages or the like to get
>>>>>> transaction. But that was not the point of the thread, was it ? The thread
>>>>>> is about vector clocks, and that has nothing to do with transaction (vector
>>>>>> clocks certainly don't give you transactions).
>>>>>>
>>>>>> Sorry if I wasn't clear in my mail, but I was only responding to why
>>>>>> so far I don't think vector clocks would really provide much for Cassandra.
>>>>>>
>>>>>> --
>>>>>> Sylvain
>>>>>>
>>>>>>
>>>>>>> -JA
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne <
>>>>>>> sylvain@datastax.com> wrote:
>>>>>>>
>>>>>>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John <
>>>>>>>> chirayithaj@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Apologies : For some reason my response on the original mail keeps
>>>>>>>>> bouncing back, thus this new one!
>>>>>>>>> > From the other hand, the same article says:
>>>>>>>>> > "For conditional writes to work, the condition must be evaluated
>>>>>>>>> at all update
>>>>>>>>> > sites before the write can be allowed to succeed."
>>>>>>>>> >
>>>>>>>>> > This means, that when doing such an update CL=ALL must be used
>>>>>>>>>
>>>>>>>>> Sorry, but I am confused by that entire thread!
>>>>>>>>>
>>>>>>>>> Questions:-
>>>>>>>>> 1. Does Cassandra implement any kind of data locking - at any
>>>>>>>>> granularity whether it be row/colF/Col ?
>>>>>>>>>
>>>>>>>>
>>>>>>>> No locking, no.
>>>>>>>>
>>>>>>>>
>>>>>>>>> 2. If the answer to 1 above is NO! - how does CL ALL prevent
>>>>>>>>> conflicts. Concurrent updates on exactly the same piece of data on different
>>>>>>>>> nodes can still mess each other up, right ?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Not sure why you are taking CL.ALL specifically. But in any CL,
>>>>>>>> updating the same piece of data means the same column value. In that case,
>>>>>>>> the resolution rules are the following:
>>>>>>>>    - If the updates have a different timestamp, keep the one with
>>>>>>>> the higher timestamp. That is, the more recent of two updates win.
>>>>>>>>   - It the timestamps are the same, then it compares the values
>>>>>>>> (byte comparison) and keep the highest value. This is just to break ties in
>>>>>>>> a consistent manner.
>>>>>>>>
>>>>>>>> So if you do two truly concurrent updates (that is from two place at
>>>>>>>> the same instant), then you'll end with one of the update. This is the
>>>>>>>> column level.
>>>>>>>>
>>>>>>>> However, if that simple conflict detection/resolution mechanism is
>>>>>>>> not good enough for some of your use case and you need to keep two
>>>>>>>> concurrent updates, it is easy enough. Just make sure that the update don't
>>>>>>>> end up in the same column. This is easily achieved by appending some unique
>>>>>>>> identifier to the column name for instance. And when reading, do a slice and
>>>>>>>> reconcile whatever you get back with whatever logic make sense. If you do
>>>>>>>> that, congrats, you've roughly emulated what vector clocks would do. Btw, no
>>>>>>>> locking or anything needed.
>>>>>>>>
>>>>>>>> In my experience, for most things the timestamp resolution is
>>>>>>>> enough. If the same user update twice it's profile picture on you web site
>>>>>>>> at the same microsecond, it's usually fine to end up with one of the two
>>>>>>>> pictures. In the rare case where you need something more specific, using the
>>>>>>>> cassandra data model usually solves the problem easily. The reason for not
>>>>>>>> having vector clocks in Cassandra is that so far, we haven't really found
>>>>>>>> much example where it is no the case.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Sylvain
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: New Chain for : Does Cassandra use vector clocks

Posted by Anthony John <ch...@gmail.com>.

Completely understand!

All that I am quibbling over is whether a CL of quorum guarantees
consistency or not. That is what the documentation says - right. IF for a CL
of Q read - it depends on which node returns read first to determine the
actual returned result or other more convoluted conditions , then a Quorum
read/write is not consistent, by any definition.

I can still use Cassandra, and will use it, luv it!!! But let us not make
this statement on the Wiki architecture section:-

-------------------------------------------------------------

More specifically: R=read replica count W=write replica count N=replication
factor Q=*QUORUM* (Q = N / 2 + 1)

   -

   If W + R > N, you will have consistency
   - W=1, R=N
   - W=N, R=1
   - W=Q, R=Q where Q = N / 2 + 1

Cassandra provides consistency when R + W > N (read replica count + write
replica count > replication factor).

----------------------------------------------------


.

On Thu, Feb 24, 2011 at 11:22 AM, Sylvain Lebresne <sy...@datastax.com>wrote:

> On Thu, Feb 24, 2011 at 6:01 PM, Anthony John <ch...@gmail.com>wrote:
>
>> If you are correct and you are probably closer to the code - then CL of
>> Quorum does not guarantee a consistency.
>
>
> If the operation succeed, it does (for some definition of consistency which
> is, following reads at Quorum will be guaranteed to see the new value of a
> update at quorum). If it fails, then no, it does not guarantee consistency.
>
> It is important to note that the word consistency has multiple meaning. In
> particular, when we are talking of consistency in Cassandra, we are not
> talking of the same definition as the C in ACID (see:
> http://www.allthingsdistributed.com/2007/12/eventually_consistent.html)
>
>>
>> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne <sy...@datastax.com>wrote:
>>
>>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John <ch...@gmail.com>wrote:
>>>
>>>>  >>Time stamps are not used for conflict resolution - unless is is part
>>>>> of the application logic!!!
>>>>>
>>>>
>>>> >>What is you definition of conflict resolution ? Because if you update
>>>> twice the same column (which
>>>> >>I'll call a conflict), then the timestamps are used to decide which
>>>> update wins (which I'll call a resolution).
>>>>
>>>> I understand what you are saying, and yes semantics is very important
>>>> here. And yes we are responding to the immediate questions without covering
>>>> all questions in the thread.
>>>>
>>>> The point being made here is that the timestamp of the column is not
>>>> used by Cassandra to figure out what data to return.
>>>>
>>>
>>> Not quite true.
>>>
>>>
>>>> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
>>>> A Quorum  Write comes and add/updates the time stamp (TS2) of a
>>>> particular data element. It succeeds on N1 - fails on N2/3. So the write is
>>>> returned as failed - right ?
>>>> Now Quorum read comes in for exactly the same piece of data that the
>>>> write failed for.
>>>> So N1 has TS2 but both N2/3 have the old TS (say TS1)
>>>> And the read succeeds - Will it return TS1 or TS2.
>>>>
>>>> I submit it will return TS1 - the old TS.
>>>>
>>>
>>> It all depends on which (first 2) nodes respond to the read (since RF=3,
>>> that can any two of N1/N2/N3). If N1 is part of the two that makes the
>>> quorum, then TS2 will be returned, because cassandra will compare the
>>> timestamp and decide what to return based on this. If N2/N3 responds
>>> however, both timestamp will be TS1 and so, after timestamp resolution, it
>>> will stil be TS1 that will be returned.
>>> So yes timestamp is used for conflict resolution.
>>>
>>> In your example, you could get TS1 back because a failed write can let
>>> you cluster in an inconsistent state. You'd have to retry the quorum and
>>> only when it succeeds can you be guaranteed that quorum read will always
>>> return TS2.
>>>
>>> This is because when a write fails, Cassandra doesn't guarantee that the
>>> write did not made it in (there is no revert).
>>>
>>>
>>>>
>>>> Are we on the same page with this interpretation ?
>>>>
>>>> Regards,
>>>>
>>>> -JA
>>>>
>>>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne <
>>>> sylvain@datastax.com> wrote:
>>>>
>>>>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John <ch...@gmail.com>wrote:
>>>>>
>>>>>> Sylvan,
>>>>>>
>>>>>> Time stamps are not used for conflict resolution - unless is is part
>>>>>> of the application logic!!!
>>>>>>
>>>>>
>>>>> What is you definition of conflict resolution ? Because if you update
>>>>> twice the same column (which
>>>>> I'll call a conflict), then the timestamps are used to decide which
>>>>> update wins (which I'll call a resolution).
>>>>>
>>>>>
>>>>>> You can have "lost updates" w/Cassandra. You need to to use 3rd
>>>>>> products - cages for e.g. - to get ACID type consistency.
>>>>>>
>>>>>
>>>>> Then again, you'll have to define what you are calling "lost updates".
>>>>> Provided you use a reasonable consistency level, Cassandra provides fairly
>>>>> strong durability guarantee, so for some definition you don't "lose
>>>>> updates".
>>>>>
>>>>> That being said, I never pretended that Cassandra provided any ACID
>>>>> guarantee. ACID relates to transaction, which Cassandra doesn't support. If
>>>>> we're talking about the guarantees of transaction, then by all means,
>>>>> cassandra won't provide it. And yes you can use cages or the like to get
>>>>> transaction. But that was not the point of the thread, was it ? The thread
>>>>> is about vector clocks, and that has nothing to do with transaction (vector
>>>>> clocks certainly don't give you transactions).
>>>>>
>>>>> Sorry if I wasn't clear in my mail, but I was only responding to why so
>>>>> far I don't think vector clocks would really provide much for Cassandra.
>>>>>
>>>>> --
>>>>> Sylvain
>>>>>
>>>>>
>>>>>> -JA
>>>>>>
>>>>>>
>>>>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne <
>>>>>> sylvain@datastax.com> wrote:
>>>>>>
>>>>>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John <chirayithaj@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Apologies : For some reason my response on the original mail keeps
>>>>>>>> bouncing back, thus this new one!
>>>>>>>> > From the other hand, the same article says:
>>>>>>>> > "For conditional writes to work, the condition must be evaluated
>>>>>>>> at all update
>>>>>>>> > sites before the write can be allowed to succeed."
>>>>>>>> >
>>>>>>>> > This means, that when doing such an update CL=ALL must be used
>>>>>>>>
>>>>>>>> Sorry, but I am confused by that entire thread!
>>>>>>>>
>>>>>>>> Questions:-
>>>>>>>> 1. Does Cassandra implement any kind of data locking - at any
>>>>>>>> granularity whether it be row/colF/Col ?
>>>>>>>>
>>>>>>>
>>>>>>> No locking, no.
>>>>>>>
>>>>>>>
>>>>>>>> 2. If the answer to 1 above is NO! - how does CL ALL prevent
>>>>>>>> conflicts. Concurrent updates on exactly the same piece of data on different
>>>>>>>> nodes can still mess each other up, right ?
>>>>>>>>
>>>>>>>
>>>>>>> Not sure why you are taking CL.ALL specifically. But in any CL,
>>>>>>> updating the same piece of data means the same column value. In that case,
>>>>>>> the resolution rules are the following:
>>>>>>>    - If the updates have a different timestamp, keep the one with the
>>>>>>> higher timestamp. That is, the more recent of two updates win.
>>>>>>>   - It the timestamps are the same, then it compares the values (byte
>>>>>>> comparison) and keep the highest value. This is just to break ties in a
>>>>>>> consistent manner.
>>>>>>>
>>>>>>> So if you do two truly concurrent updates (that is from two place at
>>>>>>> the same instant), then you'll end with one of the update. This is the
>>>>>>> column level.
>>>>>>>
>>>>>>> However, if that simple conflict detection/resolution mechanism is
>>>>>>> not good enough for some of your use case and you need to keep two
>>>>>>> concurrent updates, it is easy enough. Just make sure that the update don't
>>>>>>> end up in the same column. This is easily achieved by appending some unique
>>>>>>> identifier to the column name for instance. And when reading, do a slice and
>>>>>>> reconcile whatever you get back with whatever logic make sense. If you do
>>>>>>> that, congrats, you've roughly emulated what vector clocks would do. Btw, no
>>>>>>> locking or anything needed.
>>>>>>>
>>>>>>> In my experience, for most things the timestamp resolution is enough.
>>>>>>> If the same user update twice it's profile picture on you web site at the
>>>>>>> same microsecond, it's usually fine to end up with one of the two pictures.
>>>>>>> In the rare case where you need something more specific, using the cassandra
>>>>>>> data model usually solves the problem easily. The reason for not having
>>>>>>> vector clocks in Cassandra is that so far, we haven't really found much
>>>>>>> example where it is no the case.
>>>>>>>
>>>>>>> --
>>>>>>> Sylvain
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: New Chain for : Does Cassandra use vector clocks

Posted by Sylvain Lebresne <sy...@datastax.com>.

On Thu, Feb 24, 2011 at 6:01 PM, Anthony John <ch...@gmail.com> wrote:

> If you are correct and you are probably closer to the code - then CL of
> Quorum does not guarantee a consistency.


If the operation succeed, it does (for some definition of consistency which
is, following reads at Quorum will be guaranteed to see the new value of a
update at quorum). If it fails, then no, it does not guarantee consistency.

It is important to note that the word consistency has multiple meaning. In
particular, when we are talking of consistency in Cassandra, we are not
talking of the same definition as the C in ACID (see:
http://www.allthingsdistributed.com/2007/12/eventually_consistent.html)

>
> On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne <sy...@datastax.com>wrote:
>
>> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John <ch...@gmail.com>wrote:
>>
>>>  >>Time stamps are not used for conflict resolution - unless is is part
>>>> of the application logic!!!
>>>>
>>>
>>> >>What is you definition of conflict resolution ? Because if you update
>>> twice the same column (which
>>> >>I'll call a conflict), then the timestamps are used to decide which
>>> update wins (which I'll call a resolution).
>>>
>>> I understand what you are saying, and yes semantics is very important
>>> here. And yes we are responding to the immediate questions without covering
>>> all questions in the thread.
>>>
>>> The point being made here is that the timestamp of the column is not used
>>> by Cassandra to figure out what data to return.
>>>
>>
>> Not quite true.
>>
>>
>>> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
>>> A Quorum  Write comes and add/updates the time stamp (TS2) of a
>>> particular data element. It succeeds on N1 - fails on N2/3. So the write is
>>> returned as failed - right ?
>>> Now Quorum read comes in for exactly the same piece of data that the
>>> write failed for.
>>> So N1 has TS2 but both N2/3 have the old TS (say TS1)
>>> And the read succeeds - Will it return TS1 or TS2.
>>>
>>> I submit it will return TS1 - the old TS.
>>>
>>
>> It all depends on which (first 2) nodes respond to the read (since RF=3,
>> that can any two of N1/N2/N3). If N1 is part of the two that makes the
>> quorum, then TS2 will be returned, because cassandra will compare the
>> timestamp and decide what to return based on this. If N2/N3 responds
>> however, both timestamp will be TS1 and so, after timestamp resolution, it
>> will stil be TS1 that will be returned.
>> So yes timestamp is used for conflict resolution.
>>
>> In your example, you could get TS1 back because a failed write can let you
>> cluster in an inconsistent state. You'd have to retry the quorum and only
>> when it succeeds can you be guaranteed that quorum read will always return
>> TS2.
>>
>> This is because when a write fails, Cassandra doesn't guarantee that the
>> write did not made it in (there is no revert).
>>
>>
>>>
>>> Are we on the same page with this interpretation ?
>>>
>>> Regards,
>>>
>>> -JA
>>>
>>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne <sylvain@datastax.com
>>> > wrote:
>>>
>>>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John <ch...@gmail.com>wrote:
>>>>
>>>>> Sylvan,
>>>>>
>>>>> Time stamps are not used for conflict resolution - unless is is part of
>>>>> the application logic!!!
>>>>>
>>>>
>>>> What is you definition of conflict resolution ? Because if you update
>>>> twice the same column (which
>>>> I'll call a conflict), then the timestamps are used to decide which
>>>> update wins (which I'll call a resolution).
>>>>
>>>>
>>>>> You can have "lost updates" w/Cassandra. You need to to use 3rd
>>>>> products - cages for e.g. - to get ACID type consistency.
>>>>>
>>>>
>>>> Then again, you'll have to define what you are calling "lost updates".
>>>> Provided you use a reasonable consistency level, Cassandra provides fairly
>>>> strong durability guarantee, so for some definition you don't "lose
>>>> updates".
>>>>
>>>> That being said, I never pretended that Cassandra provided any ACID
>>>> guarantee. ACID relates to transaction, which Cassandra doesn't support. If
>>>> we're talking about the guarantees of transaction, then by all means,
>>>> cassandra won't provide it. And yes you can use cages or the like to get
>>>> transaction. But that was not the point of the thread, was it ? The thread
>>>> is about vector clocks, and that has nothing to do with transaction (vector
>>>> clocks certainly don't give you transactions).
>>>>
>>>> Sorry if I wasn't clear in my mail, but I was only responding to why so
>>>> far I don't think vector clocks would really provide much for Cassandra.
>>>>
>>>> --
>>>> Sylvain
>>>>
>>>>
>>>>> -JA
>>>>>
>>>>>
>>>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne <
>>>>> sylvain@datastax.com> wrote:
>>>>>
>>>>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John <ch...@gmail.com>wrote:
>>>>>>
>>>>>>> Apologies : For some reason my response on the original mail keeps
>>>>>>> bouncing back, thus this new one!
>>>>>>> > From the other hand, the same article says:
>>>>>>> > "For conditional writes to work, the condition must be evaluated at
>>>>>>> all update
>>>>>>> > sites before the write can be allowed to succeed."
>>>>>>> >
>>>>>>> > This means, that when doing such an update CL=ALL must be used
>>>>>>>
>>>>>>> Sorry, but I am confused by that entire thread!
>>>>>>>
>>>>>>> Questions:-
>>>>>>> 1. Does Cassandra implement any kind of data locking - at any
>>>>>>> granularity whether it be row/colF/Col ?
>>>>>>>
>>>>>>
>>>>>> No locking, no.
>>>>>>
>>>>>>
>>>>>>> 2. If the answer to 1 above is NO! - how does CL ALL prevent
>>>>>>> conflicts. Concurrent updates on exactly the same piece of data on different
>>>>>>> nodes can still mess each other up, right ?
>>>>>>>
>>>>>>
>>>>>> Not sure why you are taking CL.ALL specifically. But in any CL,
>>>>>> updating the same piece of data means the same column value. In that case,
>>>>>> the resolution rules are the following:
>>>>>>    - If the updates have a different timestamp, keep the one with the
>>>>>> higher timestamp. That is, the more recent of two updates win.
>>>>>>   - It the timestamps are the same, then it compares the values (byte
>>>>>> comparison) and keep the highest value. This is just to break ties in a
>>>>>> consistent manner.
>>>>>>
>>>>>> So if you do two truly concurrent updates (that is from two place at
>>>>>> the same instant), then you'll end with one of the update. This is the
>>>>>> column level.
>>>>>>
>>>>>> However, if that simple conflict detection/resolution mechanism is not
>>>>>> good enough for some of your use case and you need to keep two concurrent
>>>>>> updates, it is easy enough. Just make sure that the update don't end up in
>>>>>> the same column. This is easily achieved by appending some unique identifier
>>>>>> to the column name for instance. And when reading, do a slice and reconcile
>>>>>> whatever you get back with whatever logic make sense. If you do that,
>>>>>> congrats, you've roughly emulated what vector clocks would do. Btw, no
>>>>>> locking or anything needed.
>>>>>>
>>>>>> In my experience, for most things the timestamp resolution is enough.
>>>>>> If the same user update twice it's profile picture on you web site at the
>>>>>> same microsecond, it's usually fine to end up with one of the two pictures.
>>>>>> In the rare case where you need something more specific, using the cassandra
>>>>>> data model usually solves the problem easily. The reason for not having
>>>>>> vector clocks in Cassandra is that so far, we haven't really found much
>>>>>> example where it is no the case.
>>>>>>
>>>>>> --
>>>>>> Sylvain
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: New Chain for : Does Cassandra use vector clocks

Posted by Anthony John <ch...@gmail.com>.

If you are correct and you are probably closer to the code - then CL of
Quorum does not guarantee a consistency.

On Thu, Feb 24, 2011 at 10:54 AM, Sylvain Lebresne <sy...@datastax.com>wrote:

> On Thu, Feb 24, 2011 at 5:34 PM, Anthony John <ch...@gmail.com>wrote:
>
>>  >>Time stamps are not used for conflict resolution - unless is is part
>>> of the application logic!!!
>>>
>>
>> >>What is you definition of conflict resolution ? Because if you update
>> twice the same column (which
>> >>I'll call a conflict), then the timestamps are used to decide which
>> update wins (which I'll call a resolution).
>>
>> I understand what you are saying, and yes semantics is very important
>> here. And yes we are responding to the immediate questions without covering
>> all questions in the thread.
>>
>> The point being made here is that the timestamp of the column is not used
>> by Cassandra to figure out what data to return.
>>
>
> Not quite true.
>
>
>> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
>> A Quorum  Write comes and add/updates the time stamp (TS2) of a particular
>> data element. It succeeds on N1 - fails on N2/3. So the write is returned as
>> failed - right ?
>> Now Quorum read comes in for exactly the same piece of data that the write
>> failed for.
>> So N1 has TS2 but both N2/3 have the old TS (say TS1)
>> And the read succeeds - Will it return TS1 or TS2.
>>
>> I submit it will return TS1 - the old TS.
>>
>
> It all depends on which (first 2) nodes respond to the read (since RF=3,
> that can any two of N1/N2/N3). If N1 is part of the two that makes the
> quorum, then TS2 will be returned, because cassandra will compare the
> timestamp and decide what to return based on this. If N2/N3 responds
> however, both timestamp will be TS1 and so, after timestamp resolution, it
> will stil be TS1 that will be returned.
> So yes timestamp is used for conflict resolution.
>
> In your example, you could get TS1 back because a failed write can let you
> cluster in an inconsistent state. You'd have to retry the quorum and only
> when it succeeds can you be guaranteed that quorum read will always return
> TS2.
>
> This is because when a write fails, Cassandra doesn't guarantee that the
> write did not made it in (there is no revert).
>
>
>>
>> Are we on the same page with this interpretation ?
>>
>> Regards,
>>
>> -JA
>>
>> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne <sy...@datastax.com>wrote:
>>
>>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John <ch...@gmail.com>wrote:
>>>
>>>> Sylvan,
>>>>
>>>> Time stamps are not used for conflict resolution - unless is is part of
>>>> the application logic!!!
>>>>
>>>
>>> What is you definition of conflict resolution ? Because if you update
>>> twice the same column (which
>>> I'll call a conflict), then the timestamps are used to decide which
>>> update wins (which I'll call a resolution).
>>>
>>>
>>>> You can have "lost updates" w/Cassandra. You need to to use 3rd products
>>>> - cages for e.g. - to get ACID type consistency.
>>>>
>>>
>>> Then again, you'll have to define what you are calling "lost updates".
>>> Provided you use a reasonable consistency level, Cassandra provides fairly
>>> strong durability guarantee, so for some definition you don't "lose
>>> updates".
>>>
>>> That being said, I never pretended that Cassandra provided any ACID
>>> guarantee. ACID relates to transaction, which Cassandra doesn't support. If
>>> we're talking about the guarantees of transaction, then by all means,
>>> cassandra won't provide it. And yes you can use cages or the like to get
>>> transaction. But that was not the point of the thread, was it ? The thread
>>> is about vector clocks, and that has nothing to do with transaction (vector
>>> clocks certainly don't give you transactions).
>>>
>>> Sorry if I wasn't clear in my mail, but I was only responding to why so
>>> far I don't think vector clocks would really provide much for Cassandra.
>>>
>>> --
>>> Sylvain
>>>
>>>
>>>> -JA
>>>>
>>>>
>>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne <sylvain@datastax.com
>>>> > wrote:
>>>>
>>>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John <ch...@gmail.com>wrote:
>>>>>
>>>>>> Apologies : For some reason my response on the original mail keeps
>>>>>> bouncing back, thus this new one!
>>>>>> > From the other hand, the same article says:
>>>>>> > "For conditional writes to work, the condition must be evaluated at
>>>>>> all update
>>>>>> > sites before the write can be allowed to succeed."
>>>>>> >
>>>>>> > This means, that when doing such an update CL=ALL must be used
>>>>>>
>>>>>> Sorry, but I am confused by that entire thread!
>>>>>>
>>>>>> Questions:-
>>>>>> 1. Does Cassandra implement any kind of data locking - at any
>>>>>> granularity whether it be row/colF/Col ?
>>>>>>
>>>>>
>>>>> No locking, no.
>>>>>
>>>>>
>>>>>> 2. If the answer to 1 above is NO! - how does CL ALL prevent
>>>>>> conflicts. Concurrent updates on exactly the same piece of data on different
>>>>>> nodes can still mess each other up, right ?
>>>>>>
>>>>>
>>>>> Not sure why you are taking CL.ALL specifically. But in any CL,
>>>>> updating the same piece of data means the same column value. In that case,
>>>>> the resolution rules are the following:
>>>>>    - If the updates have a different timestamp, keep the one with the
>>>>> higher timestamp. That is, the more recent of two updates win.
>>>>>   - It the timestamps are the same, then it compares the values (byte
>>>>> comparison) and keep the highest value. This is just to break ties in a
>>>>> consistent manner.
>>>>>
>>>>> So if you do two truly concurrent updates (that is from two place at
>>>>> the same instant), then you'll end with one of the update. This is the
>>>>> column level.
>>>>>
>>>>> However, if that simple conflict detection/resolution mechanism is not
>>>>> good enough for some of your use case and you need to keep two concurrent
>>>>> updates, it is easy enough. Just make sure that the update don't end up in
>>>>> the same column. This is easily achieved by appending some unique identifier
>>>>> to the column name for instance. And when reading, do a slice and reconcile
>>>>> whatever you get back with whatever logic make sense. If you do that,
>>>>> congrats, you've roughly emulated what vector clocks would do. Btw, no
>>>>> locking or anything needed.
>>>>>
>>>>> In my experience, for most things the timestamp resolution is enough.
>>>>> If the same user update twice it's profile picture on you web site at the
>>>>> same microsecond, it's usually fine to end up with one of the two pictures.
>>>>> In the rare case where you need something more specific, using the cassandra
>>>>> data model usually solves the problem easily. The reason for not having
>>>>> vector clocks in Cassandra is that so far, we haven't really found much
>>>>> example where it is no the case.
>>>>>
>>>>> --
>>>>> Sylvain
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: New Chain for : Does Cassandra use vector clocks

Posted by Sylvain Lebresne <sy...@datastax.com>.

On Thu, Feb 24, 2011 at 5:34 PM, Anthony John <ch...@gmail.com> wrote:

> >>Time stamps are not used for conflict resolution - unless is is part of
>> the application logic!!!
>>
>
> >>What is you definition of conflict resolution ? Because if you update
> twice the same column (which
> >>I'll call a conflict), then the timestamps are used to decide which
> update wins (which I'll call a resolution).
>
> I understand what you are saying, and yes semantics is very important here.
> And yes we are responding to the immediate questions without covering all
> questions in the thread.
>
> The point being made here is that the timestamp of the column is not used
> by Cassandra to figure out what data to return.
>

Not quite true.


> E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
> A Quorum  Write comes and add/updates the time stamp (TS2) of a particular
> data element. It succeeds on N1 - fails on N2/3. So the write is returned as
> failed - right ?
> Now Quorum read comes in for exactly the same piece of data that the write
> failed for.
> So N1 has TS2 but both N2/3 have the old TS (say TS1)
> And the read succeeds - Will it return TS1 or TS2.
>
> I submit it will return TS1 - the old TS.
>

It all depends on which (first 2) nodes respond to the read (since RF=3,
that can any two of N1/N2/N3). If N1 is part of the two that makes the
quorum, then TS2 will be returned, because cassandra will compare the
timestamp and decide what to return based on this. If N2/N3 responds
however, both timestamp will be TS1 and so, after timestamp resolution, it
will stil be TS1 that will be returned.
So yes timestamp is used for conflict resolution.

In your example, you could get TS1 back because a failed write can let you
cluster in an inconsistent state. You'd have to retry the quorum and only
when it succeeds can you be guaranteed that quorum read will always return
TS2.

This is because when a write fails, Cassandra doesn't guarantee that the
write did not made it in (there is no revert).


>
> Are we on the same page with this interpretation ?
>
> Regards,
>
> -JA
>
> On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne <sy...@datastax.com>wrote:
>
>> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John <ch...@gmail.com>wrote:
>>
>>> Sylvan,
>>>
>>> Time stamps are not used for conflict resolution - unless is is part of
>>> the application logic!!!
>>>
>>
>> What is you definition of conflict resolution ? Because if you update
>> twice the same column (which
>> I'll call a conflict), then the timestamps are used to decide which update
>> wins (which I'll call a resolution).
>>
>>
>>> You can have "lost updates" w/Cassandra. You need to to use 3rd products
>>> - cages for e.g. - to get ACID type consistency.
>>>
>>
>> Then again, you'll have to define what you are calling "lost updates".
>> Provided you use a reasonable consistency level, Cassandra provides fairly
>> strong durability guarantee, so for some definition you don't "lose
>> updates".
>>
>> That being said, I never pretended that Cassandra provided any ACID
>> guarantee. ACID relates to transaction, which Cassandra doesn't support. If
>> we're talking about the guarantees of transaction, then by all means,
>> cassandra won't provide it. And yes you can use cages or the like to get
>> transaction. But that was not the point of the thread, was it ? The thread
>> is about vector clocks, and that has nothing to do with transaction (vector
>> clocks certainly don't give you transactions).
>>
>> Sorry if I wasn't clear in my mail, but I was only responding to why so
>> far I don't think vector clocks would really provide much for Cassandra.
>>
>> --
>> Sylvain
>>
>>
>>> -JA
>>>
>>>
>>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne <sy...@datastax.com>wrote:
>>>
>>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John <ch...@gmail.com>wrote:
>>>>
>>>>> Apologies : For some reason my response on the original mail keeps
>>>>> bouncing back, thus this new one!
>>>>> > From the other hand, the same article says:
>>>>> > "For conditional writes to work, the condition must be evaluated at
>>>>> all update
>>>>> > sites before the write can be allowed to succeed."
>>>>> >
>>>>> > This means, that when doing such an update CL=ALL must be used
>>>>>
>>>>> Sorry, but I am confused by that entire thread!
>>>>>
>>>>> Questions:-
>>>>> 1. Does Cassandra implement any kind of data locking - at any
>>>>> granularity whether it be row/colF/Col ?
>>>>>
>>>>
>>>> No locking, no.
>>>>
>>>>
>>>>> 2. If the answer to 1 above is NO! - how does CL ALL prevent conflicts.
>>>>> Concurrent updates on exactly the same piece of data on different nodes can
>>>>> still mess each other up, right ?
>>>>>
>>>>
>>>> Not sure why you are taking CL.ALL specifically. But in any CL, updating
>>>> the same piece of data means the same column value. In that case, the
>>>> resolution rules are the following:
>>>>    - If the updates have a different timestamp, keep the one with the
>>>> higher timestamp. That is, the more recent of two updates win.
>>>>   - It the timestamps are the same, then it compares the values (byte
>>>> comparison) and keep the highest value. This is just to break ties in a
>>>> consistent manner.
>>>>
>>>> So if you do two truly concurrent updates (that is from two place at the
>>>> same instant), then you'll end with one of the update. This is the column
>>>> level.
>>>>
>>>> However, if that simple conflict detection/resolution mechanism is not
>>>> good enough for some of your use case and you need to keep two concurrent
>>>> updates, it is easy enough. Just make sure that the update don't end up in
>>>> the same column. This is easily achieved by appending some unique identifier
>>>> to the column name for instance. And when reading, do a slice and reconcile
>>>> whatever you get back with whatever logic make sense. If you do that,
>>>> congrats, you've roughly emulated what vector clocks would do. Btw, no
>>>> locking or anything needed.
>>>>
>>>> In my experience, for most things the timestamp resolution is enough. If
>>>> the same user update twice it's profile picture on you web site at the same
>>>> microsecond, it's usually fine to end up with one of the two pictures. In
>>>> the rare case where you need something more specific, using the cassandra
>>>> data model usually solves the problem easily. The reason for not having
>>>> vector clocks in Cassandra is that so far, we haven't really found much
>>>> example where it is no the case.
>>>>
>>>> --
>>>> Sylvain
>>>>
>>>>
>>>
>>
>

Re: New Chain for : Does Cassandra use vector clocks

Posted by Anthony John <ch...@gmail.com>.

>
> >>Time stamps are not used for conflict resolution - unless is is part of
> the application logic!!!
>

>>What is you definition of conflict resolution ? Because if you update
twice the same column (which
>>I'll call a conflict), then the timestamps are used to decide which update
wins (which I'll call a resolution).

I understand what you are saying, and yes semantics is very important here.
And yes we are responding to the immediate questions without covering all
questions in the thread.

The point being made here is that the timestamp of the column is not used by
Cassandra to figure out what data to return.

E.g. - Quorum is 2 nodes - and RF of 3 over N1/2/3
A Quorum  Write comes and add/updates the time stamp (TS2) of a particular
data element. It succeeds on N1 - fails on N2/3. So the write is returned as
failed - right ?
Now Quorum read comes in for exactly the same piece of data that the write
failed for.
So N1 has TS2 but both N2/3 have the old TS (say TS1)
And the read succeeds - Will it return TS1 or TS2.

I submit it will return TS1 - the old TS.

Are we on the same page with this interpretation ?

Regards,

-JA

On Thu, Feb 24, 2011 at 10:12 AM, Sylvain Lebresne <sy...@datastax.com>wrote:

> On Thu, Feb 24, 2011 at 4:52 PM, Anthony John <ch...@gmail.com>wrote:
>
>> Sylvan,
>>
>> Time stamps are not used for conflict resolution - unless is is part of
>> the application logic!!!
>>
>
> What is you definition of conflict resolution ? Because if you update twice
> the same column (which
> I'll call a conflict), then the timestamps are used to decide which update
> wins (which I'll call a resolution).
>
>
>> You can have "lost updates" w/Cassandra. You need to to use 3rd products -
>> cages for e.g. - to get ACID type consistency.
>>
>
> Then again, you'll have to define what you are calling "lost updates".
> Provided you use a reasonable consistency level, Cassandra provides fairly
> strong durability guarantee, so for some definition you don't "lose
> updates".
>
> That being said, I never pretended that Cassandra provided any ACID
> guarantee. ACID relates to transaction, which Cassandra doesn't support. If
> we're talking about the guarantees of transaction, then by all means,
> cassandra won't provide it. And yes you can use cages or the like to get
> transaction. But that was not the point of the thread, was it ? The thread
> is about vector clocks, and that has nothing to do with transaction (vector
> clocks certainly don't give you transactions).
>
> Sorry if I wasn't clear in my mail, but I was only responding to why so far
> I don't think vector clocks would really provide much for Cassandra.
>
> --
> Sylvain
>
>
>> -JA
>>
>>
>> On Thu, Feb 24, 2011 at 7:41 AM, Sylvain Lebresne <sy...@datastax.com>wrote:
>>
>>> On Thu, Feb 24, 2011 at 3:22 AM, Anthony John <ch...@gmail.com>wrote:
>>>
>>>> Apologies : For some reason my response on the original mail keeps
>>>> bouncing back, thus this new one!
>>>> > From the other hand, the same article says:
>>>> > "For conditional writes to work, the condition must be evaluated at
>>>> all update
>>>> > sites before the write can be allowed to succeed."
>>>> >
>>>> > This means, that when doing such an update CL=ALL must be used
>>>>
>>>> Sorry, but I am confused by that entire thread!
>>>>
>>>> Questions:-
>>>> 1. Does Cassandra implement any kind of data locking - at any
>>>> granularity whether it be row/colF/Col ?
>>>>
>>>
>>> No locking, no.
>>>
>>>
>>>> 2. If the answer to 1 above is NO! - how does CL ALL prevent conflicts.
>>>> Concurrent updates on exactly the same piece of data on different nodes can
>>>> still mess each other up, right ?
>>>>
>>>
>>> Not sure why you are taking CL.ALL specifically. But in any CL, updating
>>> the same piece of data means the same column value. In that case, the
>>> resolution rules are the following:
>>>    - If the updates have a different timestamp, keep the one with the
>>> higher timestamp. That is, the more recent of two updates win.
>>>   - It the timestamps are the same, then it compares the values (byte
>>> comparison) and keep the highest value. This is just to break ties in a
>>> consistent manner.
>>>
>>> So if you do two truly concurrent updates (that is from two place at the
>>> same instant), then you'll end with one of the update. This is the column
>>> level.
>>>
>>> However, if that simple conflict detection/resolution mechanism is not
>>> good enough for some of your use case and you need to keep two concurrent
>>> updates, it is easy enough. Just make sure that the update don't end up in
>>> the same column. This is easily achieved by appending some unique identifier
>>> to the column name for instance. And when reading, do a slice and reconcile
>>> whatever you get back with whatever logic make sense. If you do that,
>>> congrats, you've roughly emulated what vector clocks would do. Btw, no
>>> locking or anything needed.
>>>
>>> In my experience, for most things the timestamp resolution is enough. If
>>> the same user update twice it's profile picture on you web site at the same
>>> microsecond, it's usually fine to end up with one of the two pictures. In
>>> the rare case where you need something more specific, using the cassandra
>>> data model usually solves the problem easily. The reason for not having
>>> vector clocks in Cassandra is that so far, we haven't really found much
>>> example where it is no the case.
>>>
>>> --
>>> Sylvain
>>>
>>>
>>
>

Re: New Chain for : Does Cassandra use vector clocks

Posted by Sylvain Lebresne <sy...@datastax.com>.

On Thu, Feb 24, 2011 at 3:22 AM, Anthony John <ch...@gmail.com> wrote:

> Apologies : For some reason my response on the original mail keeps bouncing
> back, thus this new one!
> > From the other hand, the same article says:
> > "For conditional writes to work, the condition must be evaluated at all
> update
> > sites before the write can be allowed to succeed."
> >
> > This means, that when doing such an update CL=ALL must be used
>
> Sorry, but I am confused by that entire thread!
>
> Questions:-
> 1. Does Cassandra implement any kind of data locking - at any granularity
> whether it be row/colF/Col ?
>

No locking, no.

> 2. If the answer to 1 above is NO! - how does CL ALL prevent conflicts.
> Concurrent updates on exactly the same piece of data on different nodes can
> still mess each other up, right ?
>

Not sure why you are taking CL.ALL specifically. But in any CL, updating the
same piece of data means the same column value. In that case, the resolution
rules are the following:
  - If the updates have a different timestamp, keep the one with the higher
timestamp. That is, the more recent of two updates win.
  - It the timestamps are the same, then it compares the values (byte
comparison) and keep the highest value. This is just to break ties in a
consistent manner.

So if you do two truly concurrent updates (that is from two place at the
same instant), then you'll end with one of the update. This is the column
level.

However, if that simple conflict detection/resolution mechanism is not good
enough for some of your use case and you need to keep two concurrent
updates, it is easy enough. Just make sure that the update don't end up in
the same column. This is easily achieved by appending some unique identifier
to the column name for instance. And when reading, do a slice and reconcile
whatever you get back with whatever logic make sense. If you do that,
congrats, you've roughly emulated what vector clocks would do. Btw, no
locking or anything needed.

In my experience, for most things the timestamp resolution is enough. If the
same user update twice it's profile picture on you web site at the same
microsecond, it's usually fine to end up with one of the two pictures. In
the rare case where you need something more specific, using the cassandra
data model usually solves the problem easily. The reason for not having
vector clocks in Cassandra is that so far, we haven't really found much
example where it is no the case.

--
Sylvain

Re: New Chain for : Does Cassandra use vector clocks

Posted by Anthony John <ch...@gmail.com>.

My 2 cents ..

1. Focus should be on the core problem Cassandra is solving i.e.
Availability, Partitioning and a form of consistency that works (in spite of
all the questions) . All this with high performance is a huge step forward -
architecturally!
2. Any enhancement should shore up the core value proposition, should not
detract from it. specifically, packing every feature into  the product might
create an easy to use kitchen sink, but also create a less nimble behemoth
(not product names here ;))
3. The beauty of open source is the ability to combine different ideas to
solve a problem - with each piece (layer) providing an  identified set of
guarantees implemented with the greatest efficiency possible.

Finally, it will be a mistake to try drive Cassandra in the direction of an
ACID data store, watering down the core value proposition.

But I just talk!

-JA

On Thu, Feb 24, 2011 at 2:46 AM, tijoriwala.ritesh <
tijoriwala.ritesh@gmail.com> wrote:

>
> If it cannot protect against lost updates, isn't that an issue? How is
> client
> support to protect against concurrency? I see lot of users mentioning the
> use of cages (i.e. use ZooKeeper) but involving locks on every writes at
> the
> application level is certainly not acceptable. And again, the application
> will end up using vector clocks anyways. IMHO, this support should be built
> into cassandra especially when it provides all the knobs to the client to
> choose the right consistency level. So if client chooses R + W > N, then it
> should be possible for Cassandra to detect conflicts.
> --
> View this message in context:
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/New-Chain-for-Does-Cassandra-use-vector-clocks-tp6058892p6059594.html
> Sent from the cassandra-user@incubator.apache.org mailing list archive at
> Nabble.com.
>

Re: New Chain for : Does Cassandra use vector clocks

Posted by "tijoriwala.ritesh" <ti...@gmail.com>.

If it cannot protect against lost updates, isn't that an issue? How is client
support to protect against concurrency? I see lot of users mentioning the
use of cages (i.e. use ZooKeeper) but involving locks on every writes at the
application level is certainly not acceptable. And again, the application
will end up using vector clocks anyways. IMHO, this support should be built
into cassandra especially when it provides all the knobs to the client to
choose the right consistency level. So if client chooses R + W > N, then it
should be possible for Cassandra to detect conflicts.
-- 
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/New-Chain-for-Does-Cassandra-use-vector-clocks-tp6058892p6059594.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: New Chain for : Does Cassandra use vector clocks

Posted by Edward Capriolo <ed...@gmail.com>.

On Wed, Feb 23, 2011 at 9:28 PM, Ritesh Tijoriwala
<ti...@gmail.com> wrote:
> I was about to ask what Anthony's latest post below captures - if we don't
> have vector clocks and no locking, how does cassandra prevent/detect
> conflicts? This is somewhat related to the question I asked in last post
> - http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-td6055152.html
> Thanks,
> Ritesh
>
>
>
> On Wed, Feb 23, 2011 at 6:22 PM, Anthony John <ch...@gmail.com> wrote:
>>
>> Apologies : For some reason my response on the original mail keeps
>> bouncing back, thus this new one!
>>
>> > From the other hand, the same article says:
>> > "For conditional writes to work, the condition must be evaluated at all
>> > update
>> > sites before the write can be allowed to succeed."
>> >
>> > This means, that when doing such an update CL=ALL must be used
>>
>> Sorry, but I am confused by that entire thread!
>> Questions:-
>> 1. Does Cassandra implement any kind of data locking - at any granularity
>> whether it be row/colF/Col ?
>> 2. If the answer to 1 above is NO! - how does CL ALL prevent conflicts.
>> Concurrent updates on exactly the same piece of data on different nodes can
>> still mess each other up, right ?
>> -JA
>

Cassandra does not provide any build in locking. It can not protect
from "lost updates" caused by multiple independent entities reading
and writing the same data.

The cages library handles locking externally and is really easy to use.
http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-using-cages/

Re: New Chain for : Does Cassandra use vector clocks

Posted by Ritesh Tijoriwala <ti...@gmail.com>.

I was about to ask what Anthony's latest post below captures - if we don't
have vector clocks and no locking, how does cassandra prevent/detect
conflicts? This is somewhat related to the question I asked in last post -
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-td6055152.html

<http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-td6055152.html>
Thanks,
Ritesh

On Wed, Feb 23, 2011 at 6:22 PM, Anthony John <ch...@gmail.com> wrote:

> Apologies : For some reason my response on the original mail keeps bouncing
> back, thus this new one!
> > From the other hand, the same article says:
> > "For conditional writes to work, the condition must be evaluated at all
> update
> > sites before the write can be allowed to succeed."
> >
> > This means, that when doing such an update CL=ALL must be used
>
> Sorry, but I am confused by that entire thread!
>
> Questions:-
> 1. Does Cassandra implement any kind of data locking - at any granularity
> whether it be row/colF/Col ?
> 2. If the answer to 1 above is NO! - how does CL ALL prevent conflicts.
> Concurrent updates on exactly the same piece of data on different nodes can
> still mess each other up, right ?
>
> -JA
>