You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Yang Yang (JIRA)" <ji...@apache.org> on 2011/06/15 08:27:47 UTC

[jira] [Created] (CASSANDRA-2774) one way to make counter delete work better

one way to make counter delete work better
------------------------------------------

                 Key: CASSANDRA-2774
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2774
             Project: Cassandra
          Issue Type: New Feature
            Reporter: Yang Yang


current Counter does not work with delete, because different merging order of sstables would produces different result, for example:

add 1

delete 

add 2

if the merging happens by 1-2, (1,2)--3  order, the result we see will be 2
if merging is: 1--3, (1,3)--2, the result will be 3.



the issue is that delete now can not separate out previous adds and adds later than the delete. supposedly a delete is to create a completely new incarnation of the counter, or a new "lifetime", or "epoch". the new approach utilizes the concept of "epoch number", so that each delete bumps up the epoch number. since each write is replicated (replicate on write is almost always enabled in practice, if this is a concern, we could further force ROW in case of delete ), so the epoch number is global to a replica set




changes are attached, existing tests pass fine, some tests are modified since the semantic is changed a bit. some cql tests do not pass in the original 0.8.0 source, that's not the fault of this change.

see details at http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTikQcgLSNwtT-9HvqpSeoo7SF58SnA@mail.gmail.com%3E


the goal of this is to make delete work ( at least with consistent behavior, yes in case of long network partition, the behavior is not ideal, but it's consistent with the definition of logical clock), so that we could have expiring Counters



--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-2774) one way to make counter delete work better

Posted by "Yang Yang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049805#comment-13049805 ] 

Yang Yang edited comment on CASSANDRA-2774 at 6/15/11 3:49 PM:
---------------------------------------------------------------

It's good to step back and ask " what is the contract ?" now.
So far cassandra lets time be dictated arbitrarily by client (
Client assigns timestamp) so it can afford to expect *the*
Ordering and *the* result

In counters client-generated ts can't be used so we can't guarantee *the* expected
Result.

But we will guaranttee that all nodes on the system eventually agree on
*some* common result, this is actually consistent with the definition
Of eventual consistency brought forth by Dynamo


( if u look at client-generated ts. It could indeed produce some
Unintuitive results: if a client sets ts to be very late into
The future, latter writes could be wiped out. For regular columns
 This is in the same way against common client expectations about ordering as in the counter case)

In summary the current implementation cannot guarattee any 
common agreement on counter value but the new approach guarantees the achievement of *some* common value



      was (Author: yangyangyyy):
    It's good to step back and ask " what is the contract ?" Now
So far cassandra let time be dictated arbitrarily by client (
Client assigns timestamp) so it can afford to expect *the*
Ordering and *the* result

In counters client-generated ts can't be used so we can't guarantee *the* expected
Result.

But we will guaranttee that all nodes on the system eventually agree on
*some* common result, this is actually consistent with the definition
Of eventual consistency brought forth by Dynamo


( if u look at client-generated ts. It could indeed produce some
Unintuitive results: if a client sets ts to be very late into
The future, latter writes could be wiped out. For regular columns
 This is in the same way against common client expectations about ordering as in the counter case)

In summary the current implementation cannot guarattee any 
common agreement on counter value but the new approach guarantees the achievement of *some* common value


  
> one way to make counter delete work better
> ------------------------------------------
>
>                 Key: CASSANDRA-2774
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2774
>             Project: Cassandra
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Yang Yang
>         Attachments: counter_delete.diff
>
>
> current Counter does not work with delete, because different merging order of sstables would produces different result, for example:
> add 1
> delete 
> add 2
> if the merging happens by 1-2, (1,2)--3  order, the result we see will be 2
> if merging is: 1--3, (1,3)--2, the result will be 3.
> the issue is that delete now can not separate out previous adds and adds later than the delete. supposedly a delete is to create a completely new incarnation of the counter, or a new "lifetime", or "epoch". the new approach utilizes the concept of "epoch number", so that each delete bumps up the epoch number. since each write is replicated (replicate on write is almost always enabled in practice, if this is a concern, we could further force ROW in case of delete ), so the epoch number is global to a replica set
> changes are attached, existing tests pass fine, some tests are modified since the semantic is changed a bit. some cql tests do not pass in the original 0.8.0 source, that's not the fault of this change.
> see details at http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTikQcgLSNwtT-9HvqpSeoo7SF58SnA@mail.gmail.com%3E
> the goal of this is to make delete work ( at least with consistent behavior, yes in case of long network partition, the behavior is not ideal, but it's consistent with the definition of logical clock), so that we could have expiring Counters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-2774) one way to make counter delete work better

Posted by "Yang Yang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049650#comment-13049650 ] 

Yang Yang edited comment on CASSANDRA-2774 at 6/15/11 6:42 AM:
---------------------------------------------------------------

note that the main logic is rather simple, and concentrated in CounterColumn.reconcile()

but most of the coding was done around the issue of setting up a "lastDeleteTimestamp()" for a completely new incoming CounterUpdateColumn, since it does not any history yet. the code in this part uses some rather messy changes, and should definitely use a better route, but so far it's only for demonstration of the idea. people more familiar with the code path can suggest a better way.


here is how it works:
when a counterUpdateColumn comes in, we put it in memtable, which goes through the reconcile process. new code obtains the state of those columns listed in the mutation, AFTER the mutation is applied. we check whether any columns are completely new, i.e. they do not have a matching one in memtable, so that their timestampOfLastDelete() is still "UNDECIDED". then for these columns, we do a read, and find any of their existing columns in SStables, and assign the correct timestampOfLastDelete() to them.  we do not do the read in sstable at first because most counter adds will already have one matching column in memtable, so now we only incur the extra sstable reading cost  when we  start a new counter in memtable.


reconcile rule:
   1)  newer epoch (timestampOfLastDelete() ) wins
   2)  UNDECIDED inherits the timestmapOfLastDelete() of the competitor   (UNDECIDED can be seen ONLY when a new column comes into memtable)
   3)  if timestampOfLastDelete() is same, non-delete wins over delete
   4)  at last, use standard merging between counterColumns





      was (Author: yangyangyyy):
    note that the main logic is rather simple, and concentrated in CounterColumn.reconcile()

but most of the coding was done around the issue of setting up a "lastDeleteTimestamp()" for a completely new incoming CounterUpdateColumn, since it does not any history yet. the code in this part uses some rather messy changes, and should definitely use a better route, but so far it's only for demonstration of the idea. people more familiar with the code path can suggest a better way.


  
> one way to make counter delete work better
> ------------------------------------------
>
>                 Key: CASSANDRA-2774
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2774
>             Project: Cassandra
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Yang Yang
>         Attachments: counter_delete.diff
>
>
> current Counter does not work with delete, because different merging order of sstables would produces different result, for example:
> add 1
> delete 
> add 2
> if the merging happens by 1-2, (1,2)--3  order, the result we see will be 2
> if merging is: 1--3, (1,3)--2, the result will be 3.
> the issue is that delete now can not separate out previous adds and adds later than the delete. supposedly a delete is to create a completely new incarnation of the counter, or a new "lifetime", or "epoch". the new approach utilizes the concept of "epoch number", so that each delete bumps up the epoch number. since each write is replicated (replicate on write is almost always enabled in practice, if this is a concern, we could further force ROW in case of delete ), so the epoch number is global to a replica set
> changes are attached, existing tests pass fine, some tests are modified since the semantic is changed a bit. some cql tests do not pass in the original 0.8.0 source, that's not the fault of this change.
> see details at http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTikQcgLSNwtT-9HvqpSeoo7SF58SnA@mail.gmail.com%3E
> the goal of this is to make delete work ( at least with consistent behavior, yes in case of long network partition, the behavior is not ideal, but it's consistent with the definition of logical clock), so that we could have expiring Counters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-2774) one way to make counter delete work better

Posted by "Yang Yang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052007#comment-13052007 ] 

Yang Yang edited comment on CASSANDRA-2774 at 6/20/11 3:16 PM:
---------------------------------------------------------------

Thanks a lot Sylvain for looking through this.


2 answers to your last comment:

1) "__it is impossible for other nodes to have a version for A's shard that is greater than what A has__", actually it is possible. D is able to have a higher version(clock) for A-shard because A's A-shard clock was decremented from some time before, due to wipe out by the delete. D *did not increment* the clock of A-shard.  in other words, the result "D having a higher A-shard clock than A" can be achieved by either "D increments the A-shard clock" (which is impossible in current implementation) *or* "A decrements its A-shard clock" which is possible through the deletion.

2) regarding your example. 
if the 3 operations are in 3 sstables, which are merged in the following order:
+1
+1
Delete

the second +1 will already have an epoch of 1 (which you pointed out , through inheritance), while the first one has an epoch of 0, 
in the new reconcile() rules (line 172--178 in the patch), because the second +1 has a different, higher epoch (timestampOfLastDelete) than 
the first one, the first +1 will be thrown away. so in this example, the final result will always be +1.

the central point of the change is that we do not ever look at timestamp() again, only look at timestampOfLastDelete  (timestamp() is only used
to assign timestampOfLastDelete for delete operations)


Thanks
Yang

      was (Author: yangyangyyy):
    Thanks a lot Sylvain for looking through this.


2 answers to your last comment:

1) "__it is impossible for other nodes to have a version for A's shard that is greater than what A has__", actually it is possible. D is able to have a higher version(clock) for A-shard because A's A-shard clock was decremented from some time before, due to wipe out by the delete. D *did not increment* the clock of A-shard.  in other words, the result "D having a higher A-shard clock than A" can be achieved by either "D increments the A-shard clock" (which is impossible) *or* "A decrements its A-shard clock" which is possible through the deletion.

2) regarding your example. 
if the 3 operations are in 3 sstables, which are merged in the following order:
+1
+1
Delete

the second +1 will already have an epoch of 1 (which you pointed out , through inheritance), while the first one has an epoch of 0, 
in the new reconcile() rules (line 172--178 in the patch), because the second +1 has a different, higher epoch (timestampOfLastDelete) than 
the first one, the first +1 will be thrown away. so in this example, the final result will always be +1.

the central point of the change is that we do not ever look at timestamp() again, only look at timestampOfLastDelete  (timestamp() is only used
to assign timestampOfLastDelete for delete operations)


Thanks
Yang
  
> one way to make counter delete work better
> ------------------------------------------
>
>                 Key: CASSANDRA-2774
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2774
>             Project: Cassandra
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Yang Yang
>         Attachments: counter_delete.diff
>
>
> current Counter does not work with delete, because different merging order of sstables would produces different result, for example:
> add 1
> delete 
> add 2
> if the merging happens by 1-2, (1,2)--3  order, the result we see will be 2
> if merging is: 1--3, (1,3)--2, the result will be 3.
> the issue is that delete now can not separate out previous adds and adds later than the delete. supposedly a delete is to create a completely new incarnation of the counter, or a new "lifetime", or "epoch". the new approach utilizes the concept of "epoch number", so that each delete bumps up the epoch number. since each write is replicated (replicate on write is almost always enabled in practice, if this is a concern, we could further force ROW in case of delete ), so the epoch number is global to a replica set
> changes are attached, existing tests pass fine, some tests are modified since the semantic is changed a bit. some cql tests do not pass in the original 0.8.0 source, that's not the fault of this change.
> see details at http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTikQcgLSNwtT-9HvqpSeoo7SF58SnA@mail.gmail.com%3E
> the goal of this is to make delete work ( at least with consistent behavior, yes in case of long network partition, the behavior is not ideal, but it's consistent with the definition of logical clock), so that we could have expiring Counters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2774) one way to make counter delete work better

Posted by "Yang Yang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050016#comment-13050016 ] 

Yang Yang commented on CASSANDRA-2774:
--------------------------------------

---"Once compaction has compacted the deletes, all node will reach common agreement." 

we probably need to look at this more closely. I was going to use the example of a node always changing its report due to the different merging order in each read,
but since you pointed out that we should allow all compaction to finish before making the judgement, let's look at another ill-case:

the way I contrive the following case is very similar to the often-mentioned compaction ill-case, and we move that effect onto network messages, because
changing the order of merging sstables in compaction is very similar to changing the order of message deliveries.

let's say we have 4 nodes, A B C D. all the traffic we observe is, with increasing timestamp():

A leader  add 1  ts=100
B leader delete  ts=200
C leader  add 2  ts=300

now the updates so far start to replicate to D: assume that D sees the following order: A.(add 1), C.(add 2), B.(delete), after these, D's state is:
[A:1 C:2, last_delete=200, timestamp=300]  

now let's all the traffic between A,B,C go through, and they fully resolve (receiving pair-wise messages and etc), so A B C all come to state: [A:nil  C:2,  last_delete=200  timestamp=300]

now A's state and D's state are different, let's say we let A repair D,  A's A-shard has a lower clock, so D will win; if we let D repair A, A's A-shard is isDelta(), so it trumps D. as  a result we never reach agreement between A and D, even though traffic is allowed to flow freely.

I just started looking inside the CounterContext logic, so I could very well be wrong. Thanks for your time looking through this.



as to performance, I don't think it will be a significant increase: 1) most application use cases will increment the same counter for many times, while it is in memtable. it's hard to imagine that most counters will be incremented only once before being dumped out to sstable. only the first increment in memtable for each counter will suffer a read into disk, if on average each counter is incremented 1000 times before being flushed, the disk read cost is amortized over 1000 increments; 2) even if  we do the disk read, any realistic counter setup already needs ROW and CL>ONE, so a disk read is needed anyway before the client is acked. here we do an extra disk read, but when we do the regular disk read for CounterMutation.makeReplicationMutation()  , the file blocks are already brought into cache by the new extra read, so it saves time, and total disk access time remains roughly the same.


> one way to make counter delete work better
> ------------------------------------------
>
>                 Key: CASSANDRA-2774
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2774
>             Project: Cassandra
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Yang Yang
>         Attachments: counter_delete.diff
>
>
> current Counter does not work with delete, because different merging order of sstables would produces different result, for example:
> add 1
> delete 
> add 2
> if the merging happens by 1-2, (1,2)--3  order, the result we see will be 2
> if merging is: 1--3, (1,3)--2, the result will be 3.
> the issue is that delete now can not separate out previous adds and adds later than the delete. supposedly a delete is to create a completely new incarnation of the counter, or a new "lifetime", or "epoch". the new approach utilizes the concept of "epoch number", so that each delete bumps up the epoch number. since each write is replicated (replicate on write is almost always enabled in practice, if this is a concern, we could further force ROW in case of delete ), so the epoch number is global to a replica set
> changes are attached, existing tests pass fine, some tests are modified since the semantic is changed a bit. some cql tests do not pass in the original 0.8.0 source, that's not the fault of this change.
> see details at http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTikQcgLSNwtT-9HvqpSeoo7SF58SnA@mail.gmail.com%3E
> the goal of this is to make delete work ( at least with consistent behavior, yes in case of long network partition, the behavior is not ideal, but it's consistent with the definition of logical clock), so that we could have expiring Counters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2774) one way to make counter delete work better

Posted by "Yang Yang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049654#comment-13049654 ] 

Yang Yang commented on CASSANDRA-2774:
--------------------------------------

the patch is against standard 0.8.0-src source

> one way to make counter delete work better
> ------------------------------------------
>
>                 Key: CASSANDRA-2774
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2774
>             Project: Cassandra
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Yang Yang
>         Attachments: counter_delete.diff
>
>
> current Counter does not work with delete, because different merging order of sstables would produces different result, for example:
> add 1
> delete 
> add 2
> if the merging happens by 1-2, (1,2)--3  order, the result we see will be 2
> if merging is: 1--3, (1,3)--2, the result will be 3.
> the issue is that delete now can not separate out previous adds and adds later than the delete. supposedly a delete is to create a completely new incarnation of the counter, or a new "lifetime", or "epoch". the new approach utilizes the concept of "epoch number", so that each delete bumps up the epoch number. since each write is replicated (replicate on write is almost always enabled in practice, if this is a concern, we could further force ROW in case of delete ), so the epoch number is global to a replica set
> changes are attached, existing tests pass fine, some tests are modified since the semantic is changed a bit. some cql tests do not pass in the original 0.8.0 source, that's not the fault of this change.
> see details at http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTikQcgLSNwtT-9HvqpSeoo7SF58SnA@mail.gmail.com%3E
> the goal of this is to make delete work ( at least with consistent behavior, yes in case of long network partition, the behavior is not ideal, but it's consistent with the definition of logical clock), so that we could have expiring Counters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2774) one way to make counter delete work better

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049834#comment-13049834 ] 

Sylvain Lebresne commented on CASSANDRA-2774:
---------------------------------------------


bq. I think with quorum delete you will guarantee timing to be consistent eoyh client And then achieve client expected result I. Your Case, id like to hear your counter example

Consider a cluster with RF=3 and counter c replicated on node A, B and C.  Consider that all operation are done by the same client connected to some other node (doesn't have to be the same each time but can be). All operations are performed at QUORUM consistency level.

The client does the following operations:
# increment c by 1
# delete c
# increment c by 1
# reads c

Because QUORUM is 2, depending on internal timings (latency on the wire and such), either only 2 or the 3 nodes will have seen each write once it is acked to the client. Again, for the same inputs and depending on timing, the client could get on the read a variety of results:
* 1 if each node have received each operation in the order issued.
* 0 or 2, if for instance, by the time the read is issued:
** the first increment only reached B and C
** the deletion only reached A and C
** the second increment only reached A and B and it happens that the two first node answering the read are B and C. The exact value depends on the exact rules for dealing with the epoch number, but in any case, B would only have the two increments and C would have the first increment and deletion (issued after the increment, so the deletion wins). So B will answer 2 and C will answer a tombstone. Whatever resolution the coordinator does, it just cannot return 1 that time.


> one way to make counter delete work better
> ------------------------------------------
>
>                 Key: CASSANDRA-2774
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2774
>             Project: Cassandra
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Yang Yang
>         Attachments: counter_delete.diff
>
>
> current Counter does not work with delete, because different merging order of sstables would produces different result, for example:
> add 1
> delete 
> add 2
> if the merging happens by 1-2, (1,2)--3  order, the result we see will be 2
> if merging is: 1--3, (1,3)--2, the result will be 3.
> the issue is that delete now can not separate out previous adds and adds later than the delete. supposedly a delete is to create a completely new incarnation of the counter, or a new "lifetime", or "epoch". the new approach utilizes the concept of "epoch number", so that each delete bumps up the epoch number. since each write is replicated (replicate on write is almost always enabled in practice, if this is a concern, we could further force ROW in case of delete ), so the epoch number is global to a replica set
> changes are attached, existing tests pass fine, some tests are modified since the semantic is changed a bit. some cql tests do not pass in the original 0.8.0 source, that's not the fault of this change.
> see details at http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTikQcgLSNwtT-9HvqpSeoo7SF58SnA@mail.gmail.com%3E
> the goal of this is to make delete work ( at least with consistent behavior, yes in case of long network partition, the behavior is not ideal, but it's consistent with the definition of logical clock), so that we could have expiring Counters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2774) one way to make counter delete work better

Posted by "Yang Yang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049874#comment-13049874 ] 

Yang Yang commented on CASSANDRA-2774:
--------------------------------------

you are right "it just cannot return 1 (at) *that time* ", 0 or 2 is the value not stable that the system had from some past
snapshot in time.

but it will eventually come to answer 1:

since our edge case above assumes that B has not got the deletion yet, the leader in the second increment can not be A, cuz otherwise B must have got the deletion from A, since on A the increment comes later. so B was the leader in the second increment.


for C, it now has new epoch,  let's say A's second increment reaches C (through repair, since A is not the leader in second increment), this increment has new epoch, so it will be accepted by C; if B's second increment reaches C, it belongs to the old epoch, it will be rejected.

for B, it is still on the old epoch,  after the second increment, B's count is 2 of the old epoch. but when A's increment goes to B through repair, or is reconciled in read with B, the result is going to be 1. if C's deletion goes to B, B is going to be brought more up to date to a value of 0 of new epoch. 



the above analysis does not go through all possible scenarios, but to give a definitive proof of the conjecture that "all nodes return *the* ordering given by client , in case of quorum read/write", I need to think more. 

but as I stated in my last comment, at least we can be sure that the new approach guarantees *some* common agreement eventually. it would be nice if we achieve *the* agreement in case of quorum, but that's not my  main argument

> one way to make counter delete work better
> ------------------------------------------
>
>                 Key: CASSANDRA-2774
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2774
>             Project: Cassandra
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Yang Yang
>         Attachments: counter_delete.diff
>
>
> current Counter does not work with delete, because different merging order of sstables would produces different result, for example:
> add 1
> delete 
> add 2
> if the merging happens by 1-2, (1,2)--3  order, the result we see will be 2
> if merging is: 1--3, (1,3)--2, the result will be 3.
> the issue is that delete now can not separate out previous adds and adds later than the delete. supposedly a delete is to create a completely new incarnation of the counter, or a new "lifetime", or "epoch". the new approach utilizes the concept of "epoch number", so that each delete bumps up the epoch number. since each write is replicated (replicate on write is almost always enabled in practice, if this is a concern, we could further force ROW in case of delete ), so the epoch number is global to a replica set
> changes are attached, existing tests pass fine, some tests are modified since the semantic is changed a bit. some cql tests do not pass in the original 0.8.0 source, that's not the fault of this change.
> see details at http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTikQcgLSNwtT-9HvqpSeoo7SF58SnA@mail.gmail.com%3E
> the goal of this is to make delete work ( at least with consistent behavior, yes in case of long network partition, the behavior is not ideal, but it's consistent with the definition of logical clock), so that we could have expiring Counters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2774) one way to make counter delete work better

Posted by "Yang Yang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yang Yang updated CASSANDRA-2774:
---------------------------------

    Attachment: counter_delete.diff

note that the main logic is rather simple, and concentrated in CounterColumn.reconcile()

but most of the coding was done around the issue of setting up a "lastDeleteTimestamp()" for a completely new incoming CounterUpdateColumn, since it does not any history yet. the code in this part uses some rather messy changes, and should definitely use a better route, but so far it's only for demonstration of the idea. people more familiar with the code path can suggest a better way.



> one way to make counter delete work better
> ------------------------------------------
>
>                 Key: CASSANDRA-2774
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2774
>             Project: Cassandra
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Yang Yang
>         Attachments: counter_delete.diff
>
>
> current Counter does not work with delete, because different merging order of sstables would produces different result, for example:
> add 1
> delete 
> add 2
> if the merging happens by 1-2, (1,2)--3  order, the result we see will be 2
> if merging is: 1--3, (1,3)--2, the result will be 3.
> the issue is that delete now can not separate out previous adds and adds later than the delete. supposedly a delete is to create a completely new incarnation of the counter, or a new "lifetime", or "epoch". the new approach utilizes the concept of "epoch number", so that each delete bumps up the epoch number. since each write is replicated (replicate on write is almost always enabled in practice, if this is a concern, we could further force ROW in case of delete ), so the epoch number is global to a replica set
> changes are attached, existing tests pass fine, some tests are modified since the semantic is changed a bit. some cql tests do not pass in the original 0.8.0 source, that's not the fault of this change.
> see details at http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTikQcgLSNwtT-9HvqpSeoo7SF58SnA@mail.gmail.com%3E
> the goal of this is to make delete work ( at least with consistent behavior, yes in case of long network partition, the behavior is not ideal, but it's consistent with the definition of logical clock), so that we could have expiring Counters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2774) one way to make counter delete work better

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-2774:
--------------------------------------

       Reviewer: slebresne
    Component/s: Core
       Priority: Minor  (was: Major)
       Assignee: Yang Yang

> one way to make counter delete work better
> ------------------------------------------
>
>                 Key: CASSANDRA-2774
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2774
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>    Affects Versions: 0.8.0
>            Reporter: Yang Yang
>            Assignee: Yang Yang
>            Priority: Minor
>         Attachments: counter_delete.diff
>
>
> current Counter does not work with delete, because different merging order of sstables would produces different result, for example:
> add 1
> delete 
> add 2
> if the merging happens by 1-2, (1,2)--3  order, the result we see will be 2
> if merging is: 1--3, (1,3)--2, the result will be 3.
> the issue is that delete now can not separate out previous adds and adds later than the delete. supposedly a delete is to create a completely new incarnation of the counter, or a new "lifetime", or "epoch". the new approach utilizes the concept of "epoch number", so that each delete bumps up the epoch number. since each write is replicated (replicate on write is almost always enabled in practice, if this is a concern, we could further force ROW in case of delete ), so the epoch number is global to a replica set
> changes are attached, existing tests pass fine, some tests are modified since the semantic is changed a bit. some cql tests do not pass in the original 0.8.0 source, that's not the fault of this change.
> see details at http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTikQcgLSNwtT-9HvqpSeoo7SF58SnA@mail.gmail.com%3E
> the goal of this is to make delete work ( at least with consistent behavior, yes in case of long network partition, the behavior is not ideal, but it's consistent with the definition of logical clock), so that we could have expiring Counters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-2774) one way to make counter delete work better

Posted by "Yang Yang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050016#comment-13050016 ] 

Yang Yang edited comment on CASSANDRA-2774 at 6/16/11 1:40 PM:
---------------------------------------------------------------

---"Once compaction has compacted the deletes, all node will reach common agreement." 

we probably need to look at this more closely. I was going to use the example of a node always changing its report due to the different merging order in each read,
but since you pointed out that we should allow all compaction to finish before making the judgement, let's look at another ill-case:

the way I contrive the following case is very similar to the often-mentioned compaction ill-case, and we move that effect onto network messages, because
changing the order of merging sstables in compaction is very similar to changing the order of message deliveries.

let's say we have 4 nodes, A B C D. all the traffic we observe is, with increasing timestamp():

A leader  add 1  ts=100
B leader delete  ts=200
C leader  add 2  ts=300

now the updates so far start to replicate to D: assume that D sees the following order: A.(add 1), C.(add 2), B.(delete), after these, D's state is:
[A:1 C:2, last_delete=200, timestamp=300]  

now let's all the traffic between A,B,C go through, and they fully resolve (receiving pair-wise messages and etc), so A B C all come to state: [A:nil  C:2,  last_delete=200  timestamp=300]

now A's state and D's state are different, let's say we let A repair D,  A's A-shard has a lower clock, so D will win; if we let D repair A, A's A-shard is isDelta(), so it trumps D. as  a result it seems we *never reach agreement between A and D*, even though traffic is allowed to flow freely.

I just started looking inside the CounterContext logic, so I could very well be wrong. Thanks for your time looking through this.



as to performance, I don't think it will be a significant increase: 1) most application use cases will increment the same counter for many times, while it is in memtable. it's hard to imagine that most counters will be incremented only once before being dumped out to sstable. only the first increment in memtable for each counter will suffer a read into disk, if on average each counter is incremented 1000 times before being flushed, the disk read cost is amortized over 1000 increments; 2) even if  we do the disk read, any realistic counter setup already needs ROW and CL>ONE, so a disk read is needed anyway before the client is acked. here we do an extra disk read, but when we do the regular disk read for CounterMutation.makeReplicationMutation()  , the file blocks are already brought into cache by the new extra read, so it saves time, and total disk access time remains roughly the same.


      was (Author: yangyangyyy):
    ---"Once compaction has compacted the deletes, all node will reach common agreement." 

we probably need to look at this more closely. I was going to use the example of a node always changing its report due to the different merging order in each read,
but since you pointed out that we should allow all compaction to finish before making the judgement, let's look at another ill-case:

the way I contrive the following case is very similar to the often-mentioned compaction ill-case, and we move that effect onto network messages, because
changing the order of merging sstables in compaction is very similar to changing the order of message deliveries.

let's say we have 4 nodes, A B C D. all the traffic we observe is, with increasing timestamp():

A leader  add 1  ts=100
B leader delete  ts=200
C leader  add 2  ts=300

now the updates so far start to replicate to D: assume that D sees the following order: A.(add 1), C.(add 2), B.(delete), after these, D's state is:
[A:1 C:2, last_delete=200, timestamp=300]  

now let's all the traffic between A,B,C go through, and they fully resolve (receiving pair-wise messages and etc), so A B C all come to state: [A:nil  C:2,  last_delete=200  timestamp=300]

now A's state and D's state are different, let's say we let A repair D,  A's A-shard has a lower clock, so D will win; if we let D repair A, A's A-shard is isDelta(), so it trumps D. as  a result we never reach agreement between A and D, even though traffic is allowed to flow freely.

I just started looking inside the CounterContext logic, so I could very well be wrong. Thanks for your time looking through this.



as to performance, I don't think it will be a significant increase: 1) most application use cases will increment the same counter for many times, while it is in memtable. it's hard to imagine that most counters will be incremented only once before being dumped out to sstable. only the first increment in memtable for each counter will suffer a read into disk, if on average each counter is incremented 1000 times before being flushed, the disk read cost is amortized over 1000 increments; 2) even if  we do the disk read, any realistic counter setup already needs ROW and CL>ONE, so a disk read is needed anyway before the client is acked. here we do an extra disk read, but when we do the regular disk read for CounterMutation.makeReplicationMutation()  , the file blocks are already brought into cache by the new extra read, so it saves time, and total disk access time remains roughly the same.

  
> one way to make counter delete work better
> ------------------------------------------
>
>                 Key: CASSANDRA-2774
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2774
>             Project: Cassandra
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Yang Yang
>         Attachments: counter_delete.diff
>
>
> current Counter does not work with delete, because different merging order of sstables would produces different result, for example:
> add 1
> delete 
> add 2
> if the merging happens by 1-2, (1,2)--3  order, the result we see will be 2
> if merging is: 1--3, (1,3)--2, the result will be 3.
> the issue is that delete now can not separate out previous adds and adds later than the delete. supposedly a delete is to create a completely new incarnation of the counter, or a new "lifetime", or "epoch". the new approach utilizes the concept of "epoch number", so that each delete bumps up the epoch number. since each write is replicated (replicate on write is almost always enabled in practice, if this is a concern, we could further force ROW in case of delete ), so the epoch number is global to a replica set
> changes are attached, existing tests pass fine, some tests are modified since the semantic is changed a bit. some cql tests do not pass in the original 0.8.0 source, that's not the fault of this change.
> see details at http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTikQcgLSNwtT-9HvqpSeoo7SF58SnA@mail.gmail.com%3E
> the goal of this is to make delete work ( at least with consistent behavior, yes in case of long network partition, the behavior is not ideal, but it's consistent with the definition of logical clock), so that we could have expiring Counters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2774) one way to make counter delete work better

Posted by "Yang Yang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049651#comment-13049651 ] 

Yang Yang commented on CASSANDRA-2774:
--------------------------------------

note that the original idea is to use the "epoch" number, which is a pure traditional logical clock, but I found that we already have the var of timestampOfLastDelete(), 
this has exactly the same effect, so  I just re-used the timestampOfLastDelete() for the purpose of epoch number.


> one way to make counter delete work better
> ------------------------------------------
>
>                 Key: CASSANDRA-2774
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2774
>             Project: Cassandra
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Yang Yang
>         Attachments: counter_delete.diff
>
>
> current Counter does not work with delete, because different merging order of sstables would produces different result, for example:
> add 1
> delete 
> add 2
> if the merging happens by 1-2, (1,2)--3  order, the result we see will be 2
> if merging is: 1--3, (1,3)--2, the result will be 3.
> the issue is that delete now can not separate out previous adds and adds later than the delete. supposedly a delete is to create a completely new incarnation of the counter, or a new "lifetime", or "epoch". the new approach utilizes the concept of "epoch number", so that each delete bumps up the epoch number. since each write is replicated (replicate on write is almost always enabled in practice, if this is a concern, we could further force ROW in case of delete ), so the epoch number is global to a replica set
> changes are attached, existing tests pass fine, some tests are modified since the semantic is changed a bit. some cql tests do not pass in the original 0.8.0 source, that's not the fault of this change.
> see details at http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTikQcgLSNwtT-9HvqpSeoo7SF58SnA@mail.gmail.com%3E
> the goal of this is to make delete work ( at least with consistent behavior, yes in case of long network partition, the behavior is not ideal, but it's consistent with the definition of logical clock), so that we could have expiring Counters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2774) one way to make counter delete work better

Posted by "Yang Yang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049805#comment-13049805 ] 

Yang Yang commented on CASSANDRA-2774:
--------------------------------------

It's good to step back and ask " what is the contract ?" Now
So far cassandra let time be dictated arbitrarily by client (
Client assigns timestamp) so it can afford to expect* **** the* ***
Ordering and *the** result

In counters client-generated ts can't be used so we can't guarantee* the* expected
Result.

But we will guaranttee that all nodes on the system eventually agree on
*some* common result, this is actually consistent with the definition
Of eventual consistency brought forth by Dynamo


( if u look at client-generated ts. It could indeed produce some
Unintuitive results: if a client sets ts to be very late into
The future, latter writes could be wiped out. For regular columns
 This is in the same way against common client expectations about ordering as in the counter case)

In summary the current implementation cannot guarattee any 
common agreement on counter value but the new approach guarantees thecommon value

Achievement of some 

> one way to make counter delete work better
> ------------------------------------------
>
>                 Key: CASSANDRA-2774
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2774
>             Project: Cassandra
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Yang Yang
>         Attachments: counter_delete.diff
>
>
> current Counter does not work with delete, because different merging order of sstables would produces different result, for example:
> add 1
> delete 
> add 2
> if the merging happens by 1-2, (1,2)--3  order, the result we see will be 2
> if merging is: 1--3, (1,3)--2, the result will be 3.
> the issue is that delete now can not separate out previous adds and adds later than the delete. supposedly a delete is to create a completely new incarnation of the counter, or a new "lifetime", or "epoch". the new approach utilizes the concept of "epoch number", so that each delete bumps up the epoch number. since each write is replicated (replicate on write is almost always enabled in practice, if this is a concern, we could further force ROW in case of delete ), so the epoch number is global to a replica set
> changes are attached, existing tests pass fine, some tests are modified since the semantic is changed a bit. some cql tests do not pass in the original 0.8.0 source, that's not the fault of this change.
> see details at http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTikQcgLSNwtT-9HvqpSeoo7SF58SnA@mail.gmail.com%3E
> the goal of this is to make delete work ( at least with consistent behavior, yes in case of long network partition, the behavior is not ideal, but it's consistent with the definition of logical clock), so that we could have expiring Counters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-2774) one way to make counter delete work better

Posted by "Yang Yang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052007#comment-13052007 ] 

Yang Yang edited comment on CASSANDRA-2774 at 6/20/11 3:42 PM:
---------------------------------------------------------------

Thanks a lot Sylvain for looking through this.


2 answers to your last comment:

1) "__it is impossible for other nodes to have a version for A's shard that is greater than what A has__", actually it is possible. D is able to have a higher version(clock) for A-shard because A's A-shard clock was decremented from some time before, due to wipe out by the delete. D *did not increment* the clock of A-shard.  in other words, the result "D having a higher A-shard clock than A" can be achieved by either "D increments the A-shard clock" (which is impossible in current implementation) *or* "A decrements its A-shard clock" which is possible through the deletion.

2) regarding your example. 
if the 3 operations are in 3 sstables, which are merged in the following order:
+1
+1
Delete

the second +1 will already have an epoch of 1 (which you pointed out , through inheritance), while the first one has an epoch of 0, 
in the new reconcile() rules (line 172--178 in the patch), because the second +1 has a different, higher epoch (timestampOfLastDelete) than 
the first one, the first +1 will be thrown away. so in this example, the final result will always be +1. this is also shown is the test case in lines 565-578 in the patch

the central point of the change is that we do not ever look at timestamp() again, only look at timestampOfLastDelete  (timestamp() is only used
to assign timestampOfLastDelete for delete operations)


Thanks
Yang

      was (Author: yangyangyyy):
    Thanks a lot Sylvain for looking through this.


2 answers to your last comment:

1) "__it is impossible for other nodes to have a version for A's shard that is greater than what A has__", actually it is possible. D is able to have a higher version(clock) for A-shard because A's A-shard clock was decremented from some time before, due to wipe out by the delete. D *did not increment* the clock of A-shard.  in other words, the result "D having a higher A-shard clock than A" can be achieved by either "D increments the A-shard clock" (which is impossible in current implementation) *or* "A decrements its A-shard clock" which is possible through the deletion.

2) regarding your example. 
if the 3 operations are in 3 sstables, which are merged in the following order:
+1
+1
Delete

the second +1 will already have an epoch of 1 (which you pointed out , through inheritance), while the first one has an epoch of 0, 
in the new reconcile() rules (line 172--178 in the patch), because the second +1 has a different, higher epoch (timestampOfLastDelete) than 
the first one, the first +1 will be thrown away. so in this example, the final result will always be +1.

the central point of the change is that we do not ever look at timestamp() again, only look at timestampOfLastDelete  (timestamp() is only used
to assign timestampOfLastDelete for delete operations)


Thanks
Yang
  
> one way to make counter delete work better
> ------------------------------------------
>
>                 Key: CASSANDRA-2774
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2774
>             Project: Cassandra
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Yang Yang
>         Attachments: counter_delete.diff
>
>
> current Counter does not work with delete, because different merging order of sstables would produces different result, for example:
> add 1
> delete 
> add 2
> if the merging happens by 1-2, (1,2)--3  order, the result we see will be 2
> if merging is: 1--3, (1,3)--2, the result will be 3.
> the issue is that delete now can not separate out previous adds and adds later than the delete. supposedly a delete is to create a completely new incarnation of the counter, or a new "lifetime", or "epoch". the new approach utilizes the concept of "epoch number", so that each delete bumps up the epoch number. since each write is replicated (replicate on write is almost always enabled in practice, if this is a concern, we could further force ROW in case of delete ), so the epoch number is global to a replica set
> changes are attached, existing tests pass fine, some tests are modified since the semantic is changed a bit. some cql tests do not pass in the original 0.8.0 source, that's not the fault of this change.
> see details at http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTikQcgLSNwtT-9HvqpSeoo7SF58SnA@mail.gmail.com%3E
> the goal of this is to make delete work ( at least with consistent behavior, yes in case of long network partition, the behavior is not ideal, but it's consistent with the definition of logical clock), so that we could have expiring Counters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2774) one way to make counter delete work better

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13051829#comment-13051829 ] 

Sylvain Lebresne commented on CASSANDRA-2774:
---------------------------------------------

{quote}
let's say we have 4 nodes, A B C D. all the traffic we observe is, with increasing timestamp():

A leader add 1 ts=100
B leader delete ts=200
C leader add 2 ts=300

now the updates so far start to replicate to D: assume that D sees the following order: A.(add 1), C.(add 2), B.(delete), after these, D's state is:
[A:1 C:2, last_delete=200, timestamp=300]

now let's all the traffic between A,B,C go through, and they fully resolve (receiving pair-wise messages and etc), so A B C all come to state: [A:nil C:2, last_delete=200 timestamp=300]

now A's state and D's state are different, let's say we let A repair D, A's A-shard has a lower clock, so D will win; if we let D repair A, A's A-shard is isDelta(), so it trumps D. as a result it seems we never reach agreement between A and D, even though traffic is allowed to flow freely.
{quote}

This is *not* how the counter implementation works. In the implementation, only A is ever able to increment it's own clock. As a consequence, it is impossible for other nodes to have a version for A's shard that is greater than what A has. That "scenario" is not a valid scenario.

Now, looking at your patch a bit more closely, I actually fail to see how it changes anything. You do impose a read during the write but it only changes the CounterColumn.timestampOfLastDelete field. Hence the code is still dependant on the merge order of sstables. To be more concrete, even on a single node cluster, suppose you do 3 writes (received in that exact order since we're considering only one node): +1 then delete then +1. The first +1 will have an initial "epoch", let's say 0, the delete will have a bigger epoch, let's say 1 and the second +1 will inherit that epoch 1. But there is nothing that forces all those updates to be in the same sstables and if they are in different sstables and the two increments are merged first, it will results in a +2 with epoch 1, that, when merge with the tombstone, will just discard it (as of the rules of your patch) and we will finally return +2 to the client. But if the merge order is different and we first merge the first increment to the tombstone and then to the second increment, the final result will be +1. Exactly as the code already does.


> one way to make counter delete work better
> ------------------------------------------
>
>                 Key: CASSANDRA-2774
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2774
>             Project: Cassandra
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Yang Yang
>         Attachments: counter_delete.diff
>
>
> current Counter does not work with delete, because different merging order of sstables would produces different result, for example:
> add 1
> delete 
> add 2
> if the merging happens by 1-2, (1,2)--3  order, the result we see will be 2
> if merging is: 1--3, (1,3)--2, the result will be 3.
> the issue is that delete now can not separate out previous adds and adds later than the delete. supposedly a delete is to create a completely new incarnation of the counter, or a new "lifetime", or "epoch". the new approach utilizes the concept of "epoch number", so that each delete bumps up the epoch number. since each write is replicated (replicate on write is almost always enabled in practice, if this is a concern, we could further force ROW in case of delete ), so the epoch number is global to a replica set
> changes are attached, existing tests pass fine, some tests are modified since the semantic is changed a bit. some cql tests do not pass in the original 0.8.0 source, that's not the fault of this change.
> see details at http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTikQcgLSNwtT-9HvqpSeoo7SF58SnA@mail.gmail.com%3E
> the goal of this is to make delete work ( at least with consistent behavior, yes in case of long network partition, the behavior is not ideal, but it's consistent with the definition of logical clock), so that we could have expiring Counters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-2774) one way to make counter delete work better

Posted by "Yang Yang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049771#comment-13049771 ] 

Yang Yang edited comment on CASSANDRA-2774 at 6/15/11 3:51 PM:
---------------------------------------------------------------

--- "the only valid answer...." not really

In fact this is the same reasoning that I explained in your
Net partition case on gmail.g list
You expect that it is"  the only valid"  case 
*because*  of assumed timing order from client respective
but we know that timing
Is not trustworthy and this order on client is
Not respected because client does not participate
In the logical clock protocol of epoch generation

Btw we'd better force ROW for delete so we achieve
Faster agreement and lose fewer adds
On old epoch. 

I think with quorum delete you will guarantee
timing to be consistent with client
And then achieve client expected result I. Your
Case, id like to hear your counter example


Pardon the typing, on phone, will be
On computer soon



      was (Author: yangyangyyy):
    --- "the only valid answer...." not really

In fact this is the same reasoning that I explained in your
Net partition case on gmail.g list
You expect that it is"  the only valid"  case* * 
Because* ** of assumed timing order from client respective
but we know that timing
Is not trustworthy and this order on client is
Not respected because client does not participate
In the logical clock protocol of epoch generation

Btw we'd better force ROW for delete so we achieve
Faster agreement and lose fewer adds
On old epoch. 

I think with quorum delete you will guarantee
timing to be consistent eoyh client
And then achieve client expected result I. Your
Case, id like to hear your counter example


Pardon the typing, on phone, will be
On computer soon


  
> one way to make counter delete work better
> ------------------------------------------
>
>                 Key: CASSANDRA-2774
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2774
>             Project: Cassandra
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Yang Yang
>         Attachments: counter_delete.diff
>
>
> current Counter does not work with delete, because different merging order of sstables would produces different result, for example:
> add 1
> delete 
> add 2
> if the merging happens by 1-2, (1,2)--3  order, the result we see will be 2
> if merging is: 1--3, (1,3)--2, the result will be 3.
> the issue is that delete now can not separate out previous adds and adds later than the delete. supposedly a delete is to create a completely new incarnation of the counter, or a new "lifetime", or "epoch". the new approach utilizes the concept of "epoch number", so that each delete bumps up the epoch number. since each write is replicated (replicate on write is almost always enabled in practice, if this is a concern, we could further force ROW in case of delete ), so the epoch number is global to a replica set
> changes are attached, existing tests pass fine, some tests are modified since the semantic is changed a bit. some cql tests do not pass in the original 0.8.0 source, that's not the fault of this change.
> see details at http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTikQcgLSNwtT-9HvqpSeoo7SF58SnA@mail.gmail.com%3E
> the goal of this is to make delete work ( at least with consistent behavior, yes in case of long network partition, the behavior is not ideal, but it's consistent with the definition of logical clock), so that we could have expiring Counters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2774) one way to make counter delete work better

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049912#comment-13049912 ] 

Sylvain Lebresne commented on CASSANDRA-2774:
---------------------------------------------

bq. but as I stated in my last comment, at least we can be sure that the new approach guarantees some common agreement eventually. 

It is already the case with the current implementation. Once compaction has compacted the deletes, all node will reach common agreement.

bq. it would be nice if we achieve the agreement in case of quorum, but that's not my main argument

My main argument is that this patch slightly change the behavior here and there but I don't think it adds any tangible new guarantee that people can work with. On the other side, it adds a fairly heavy performance hit by adding a read before write on every replica (and though you won't necessary do a read for every write, you will do that read more often than not as soon as the set of counters you're incrementing is not small enough).

> one way to make counter delete work better
> ------------------------------------------
>
>                 Key: CASSANDRA-2774
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2774
>             Project: Cassandra
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Yang Yang
>         Attachments: counter_delete.diff
>
>
> current Counter does not work with delete, because different merging order of sstables would produces different result, for example:
> add 1
> delete 
> add 2
> if the merging happens by 1-2, (1,2)--3  order, the result we see will be 2
> if merging is: 1--3, (1,3)--2, the result will be 3.
> the issue is that delete now can not separate out previous adds and adds later than the delete. supposedly a delete is to create a completely new incarnation of the counter, or a new "lifetime", or "epoch". the new approach utilizes the concept of "epoch number", so that each delete bumps up the epoch number. since each write is replicated (replicate on write is almost always enabled in practice, if this is a concern, we could further force ROW in case of delete ), so the epoch number is global to a replica set
> changes are attached, existing tests pass fine, some tests are modified since the semantic is changed a bit. some cql tests do not pass in the original 0.8.0 source, that's not the fault of this change.
> see details at http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTikQcgLSNwtT-9HvqpSeoo7SF58SnA@mail.gmail.com%3E
> the goal of this is to make delete work ( at least with consistent behavior, yes in case of long network partition, the behavior is not ideal, but it's consistent with the definition of logical clock), so that we could have expiring Counters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2774) one way to make counter delete work better

Posted by "Yang Yang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049649#comment-13049649 ] 

Yang Yang commented on CASSANDRA-2774:
--------------------------------------

see original jira  https://issues.apache.org/jira/browse/CASSANDRA-2101


> one way to make counter delete work better
> ------------------------------------------
>
>                 Key: CASSANDRA-2774
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2774
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Yang Yang
>
> current Counter does not work with delete, because different merging order of sstables would produces different result, for example:
> add 1
> delete 
> add 2
> if the merging happens by 1-2, (1,2)--3  order, the result we see will be 2
> if merging is: 1--3, (1,3)--2, the result will be 3.
> the issue is that delete now can not separate out previous adds and adds later than the delete. supposedly a delete is to create a completely new incarnation of the counter, or a new "lifetime", or "epoch". the new approach utilizes the concept of "epoch number", so that each delete bumps up the epoch number. since each write is replicated (replicate on write is almost always enabled in practice, if this is a concern, we could further force ROW in case of delete ), so the epoch number is global to a replica set
> changes are attached, existing tests pass fine, some tests are modified since the semantic is changed a bit. some cql tests do not pass in the original 0.8.0 source, that's not the fault of this change.
> see details at http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTikQcgLSNwtT-9HvqpSeoo7SF58SnA@mail.gmail.com%3E
> the goal of this is to make delete work ( at least with consistent behavior, yes in case of long network partition, the behavior is not ideal, but it's consistent with the definition of logical clock), so that we could have expiring Counters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2774) one way to make counter delete work better

Posted by "Yang Yang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049771#comment-13049771 ] 

Yang Yang commented on CASSANDRA-2774:
--------------------------------------

--- "the only valid answer...." not really

In fact this is the same reasoning that I explained in your
Net partition case on gmail.g list
You expect that it is"  the only valid"  case* * 
Because* ** of assumed timing order from client respective
but we know that timing
Is not trustworthy and this order on client is
Not respected because client does not participate
In the logical clock protocol of epoch generation

Btw we'd better force ROW for delete so we achieve
Faster agreement and lose fewer adds
On old epoch. 

I think with quorum delete you will guarantee
timing to be consistent eoyh client
And then achieve client expected result I. Your
Case, id like to hear your counter example


Pardon the typing, on phone, will be
On computer soon



> one way to make counter delete work better
> ------------------------------------------
>
>                 Key: CASSANDRA-2774
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2774
>             Project: Cassandra
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Yang Yang
>         Attachments: counter_delete.diff
>
>
> current Counter does not work with delete, because different merging order of sstables would produces different result, for example:
> add 1
> delete 
> add 2
> if the merging happens by 1-2, (1,2)--3  order, the result we see will be 2
> if merging is: 1--3, (1,3)--2, the result will be 3.
> the issue is that delete now can not separate out previous adds and adds later than the delete. supposedly a delete is to create a completely new incarnation of the counter, or a new "lifetime", or "epoch". the new approach utilizes the concept of "epoch number", so that each delete bumps up the epoch number. since each write is replicated (replicate on write is almost always enabled in practice, if this is a concern, we could further force ROW in case of delete ), so the epoch number is global to a replica set
> changes are attached, existing tests pass fine, some tests are modified since the semantic is changed a bit. some cql tests do not pass in the original 0.8.0 source, that's not the fault of this change.
> see details at http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTikQcgLSNwtT-9HvqpSeoo7SF58SnA@mail.gmail.com%3E
> the goal of this is to make delete work ( at least with consistent behavior, yes in case of long network partition, the behavior is not ideal, but it's consistent with the definition of logical clock), so that we could have expiring Counters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-2774) one way to make counter delete work better

Posted by "Yang Yang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052007#comment-13052007 ] 

Yang Yang edited comment on CASSANDRA-2774 at 6/20/11 3:15 PM:
---------------------------------------------------------------

Thanks a lot Sylvain for looking through this.


2 answers to your last comment:

1) "__it is impossible for other nodes to have a version for A's shard that is greater than what A has__", actually it is possible. D is able to have a higher version(clock) for A-shard because A's A-shard clock was decremented from some time before, due to wipe out by the delete. D *did not increment* the clock of A-shard.  in other words, the result "D having a higher A-shard clock than A" can be achieved by either "D increments the A-shard clock" (which is impossible) *or* "A decrements its A-shard clock" which is possible through the deletion.

2) regarding your example. 
if the 3 operations are in 3 sstables, which are merged in the following order:
+1
+1
Delete

the second +1 will already have an epoch of 1 (which you pointed out , through inheritance), while the first one has an epoch of 0, 
in the new reconcile() rules (line 172--178 in the patch), because the second +1 has a different, higher epoch (timestampOfLastDelete) than 
the first one, the first +1 will be thrown away. so in this example, the final result will always be +1.

the central point of the change is that we do not ever look at timestamp() again, only look at timestampOfLastDelete  (timestamp() is only used
to assign timestampOfLastDelete for delete operations)


Thanks
Yang

      was (Author: yangyangyyy):
    Thanks a lot Sylvain for looking through this.


2 answers to your last comment:

1) "it is impossible for other nodes to have a version for A's shard that is greater than what A has", actually it is possible. D is able to have a higher version(clock) for A-shard because A's A-shard clock was decremented from some time before, due to wipe out by the delete. D *did not increment* the clock of A-shard.  in other words, the result "D having a higher A-shard clock than A" can be achieved by either "D increments the A-shard clock" (which is impossible) *or* "A decrements its A-shard clock" which is possible through the deletion.

2) regarding your example. 
if the 3 operations are in 3 sstables, which are merged in the following order:
+1
+1
Delete

the second +1 will already have an epoch of 1 (which you pointed out , through inheritance), while the first one has an epoch of 0, 
in the new reconcile() rules (line 172--178 in the patch), because the second +1 has a different, higher epoch (timestampOfLastDelete) than 
the first one, the first +1 will be thrown away. so in this example, the final result will always be +1.

the central point of the change is that we do not ever look at timestamp() again, only look at timestampOfLastDelete  (timestamp() is only used
to assign timestampOfLastDelete for delete operations)


Thanks
Yang
  
> one way to make counter delete work better
> ------------------------------------------
>
>                 Key: CASSANDRA-2774
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2774
>             Project: Cassandra
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Yang Yang
>         Attachments: counter_delete.diff
>
>
> current Counter does not work with delete, because different merging order of sstables would produces different result, for example:
> add 1
> delete 
> add 2
> if the merging happens by 1-2, (1,2)--3  order, the result we see will be 2
> if merging is: 1--3, (1,3)--2, the result will be 3.
> the issue is that delete now can not separate out previous adds and adds later than the delete. supposedly a delete is to create a completely new incarnation of the counter, or a new "lifetime", or "epoch". the new approach utilizes the concept of "epoch number", so that each delete bumps up the epoch number. since each write is replicated (replicate on write is almost always enabled in practice, if this is a concern, we could further force ROW in case of delete ), so the epoch number is global to a replica set
> changes are attached, existing tests pass fine, some tests are modified since the semantic is changed a bit. some cql tests do not pass in the original 0.8.0 source, that's not the fault of this change.
> see details at http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTikQcgLSNwtT-9HvqpSeoo7SF58SnA@mail.gmail.com%3E
> the goal of this is to make delete work ( at least with consistent behavior, yes in case of long network partition, the behavior is not ideal, but it's consistent with the definition of logical clock), so that we could have expiring Counters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (CASSANDRA-2774) one way to make counter delete work better

Posted by "Yang Yang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049805#comment-13049805 ] 

Yang Yang edited comment on CASSANDRA-2774 at 6/15/11 3:48 PM:
---------------------------------------------------------------

It's good to step back and ask " what is the contract ?" Now
So far cassandra let time be dictated arbitrarily by client (
Client assigns timestamp) so it can afford to expect *the*
Ordering and *the* result

In counters client-generated ts can't be used so we can't guarantee *the* expected
Result.

But we will guaranttee that all nodes on the system eventually agree on
*some* common result, this is actually consistent with the definition
Of eventual consistency brought forth by Dynamo


( if u look at client-generated ts. It could indeed produce some
Unintuitive results: if a client sets ts to be very late into
The future, latter writes could be wiped out. For regular columns
 This is in the same way against common client expectations about ordering as in the counter case)

In summary the current implementation cannot guarattee any 
common agreement on counter value but the new approach guarantees the achievement of *some* common value



      was (Author: yangyangyyy):
    It's good to step back and ask " what is the contract ?" Now
So far cassandra let time be dictated arbitrarily by client (
Client assigns timestamp) so it can afford to expect* **** the* ***
Ordering and *the** result

In counters client-generated ts can't be used so we can't guarantee* the* expected
Result.

But we will guaranttee that all nodes on the system eventually agree on
*some* common result, this is actually consistent with the definition
Of eventual consistency brought forth by Dynamo


( if u look at client-generated ts. It could indeed produce some
Unintuitive results: if a client sets ts to be very late into
The future, latter writes could be wiped out. For regular columns
 This is in the same way against common client expectations about ordering as in the counter case)

In summary the current implementation cannot guarattee any 
common agreement on counter value but the new approach guarantees thecommon value

Achievement of some 
  
> one way to make counter delete work better
> ------------------------------------------
>
>                 Key: CASSANDRA-2774
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2774
>             Project: Cassandra
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Yang Yang
>         Attachments: counter_delete.diff
>
>
> current Counter does not work with delete, because different merging order of sstables would produces different result, for example:
> add 1
> delete 
> add 2
> if the merging happens by 1-2, (1,2)--3  order, the result we see will be 2
> if merging is: 1--3, (1,3)--2, the result will be 3.
> the issue is that delete now can not separate out previous adds and adds later than the delete. supposedly a delete is to create a completely new incarnation of the counter, or a new "lifetime", or "epoch". the new approach utilizes the concept of "epoch number", so that each delete bumps up the epoch number. since each write is replicated (replicate on write is almost always enabled in practice, if this is a concern, we could further force ROW in case of delete ), so the epoch number is global to a replica set
> changes are attached, existing tests pass fine, some tests are modified since the semantic is changed a bit. some cql tests do not pass in the original 0.8.0 source, that's not the fault of this change.
> see details at http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTikQcgLSNwtT-9HvqpSeoo7SF58SnA@mail.gmail.com%3E
> the goal of this is to make delete work ( at least with consistent behavior, yes in case of long network partition, the behavior is not ideal, but it's consistent with the definition of logical clock), so that we could have expiring Counters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2774) one way to make counter delete work better

Posted by "Yang Yang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13052007#comment-13052007 ] 

Yang Yang commented on CASSANDRA-2774:
--------------------------------------

Thanks a lot Sylvain for looking through this.


2 answers to your last comment:

1) "it is impossible for other nodes to have a version for A's shard that is greater than what A has", actually it is possible. D is able to have a higher version(clock) for A-shard because A's A-shard clock was decremented from some time before, due to wipe out by the delete. D *did not increment* the clock of A-shard.  in other words, the result "D having a higher A-shard clock than A" can be achieved by either "D increments the A-shard clock" (which is impossible) *or* "A decrements its A-shard clock" which is possible through the deletion.

2) regarding your example. 
if the 3 operations are in 3 sstables, which are merged in the following order:
+1
+1
Delete

the second +1 will already have an epoch of 1 (which you pointed out , through inheritance), while the first one has an epoch of 0, 
in the new reconcile() rules (line 172--178 in the patch), because the second +1 has a different, higher epoch (timestampOfLastDelete) than 
the first one, the first +1 will be thrown away. so in this example, the final result will always be +1.

the central point of the change is that we do not ever look at timestamp() again, only look at timestampOfLastDelete  (timestamp() is only used
to assign timestampOfLastDelete for delete operations)


Thanks
Yang

> one way to make counter delete work better
> ------------------------------------------
>
>                 Key: CASSANDRA-2774
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2774
>             Project: Cassandra
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Yang Yang
>         Attachments: counter_delete.diff
>
>
> current Counter does not work with delete, because different merging order of sstables would produces different result, for example:
> add 1
> delete 
> add 2
> if the merging happens by 1-2, (1,2)--3  order, the result we see will be 2
> if merging is: 1--3, (1,3)--2, the result will be 3.
> the issue is that delete now can not separate out previous adds and adds later than the delete. supposedly a delete is to create a completely new incarnation of the counter, or a new "lifetime", or "epoch". the new approach utilizes the concept of "epoch number", so that each delete bumps up the epoch number. since each write is replicated (replicate on write is almost always enabled in practice, if this is a concern, we could further force ROW in case of delete ), so the epoch number is global to a replica set
> changes are attached, existing tests pass fine, some tests are modified since the semantic is changed a bit. some cql tests do not pass in the original 0.8.0 source, that's not the fault of this change.
> see details at http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTikQcgLSNwtT-9HvqpSeoo7SF58SnA@mail.gmail.com%3E
> the goal of this is to make delete work ( at least with consistent behavior, yes in case of long network partition, the behavior is not ideal, but it's consistent with the definition of logical clock), so that we could have expiring Counters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2774) one way to make counter delete work better

Posted by "Yang Yang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050025#comment-13050025 ] 

Yang Yang commented on CASSANDRA-2774:
--------------------------------------

one further thing:
even if the current implementation reaches a common agreement, it can be some very arbitrary state:yes everyone uses the same rules so if messages are all played out, there is got to be the same final winner. but the winner wins by some very *arbitrary* rules---you can't reason about the final state. in the new implementation, the end result is a "serializable" state, meaning that there exists a sequence of a subset of incoming messages, so that if they are played out on a node in serial, we would achieve the final result, then the final result can be explained to users to be due to such a sequence of messages, after message loss and re-ordering.  I don't think the current implementation has this property



> one way to make counter delete work better
> ------------------------------------------
>
>                 Key: CASSANDRA-2774
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2774
>             Project: Cassandra
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Yang Yang
>         Attachments: counter_delete.diff
>
>
> current Counter does not work with delete, because different merging order of sstables would produces different result, for example:
> add 1
> delete 
> add 2
> if the merging happens by 1-2, (1,2)--3  order, the result we see will be 2
> if merging is: 1--3, (1,3)--2, the result will be 3.
> the issue is that delete now can not separate out previous adds and adds later than the delete. supposedly a delete is to create a completely new incarnation of the counter, or a new "lifetime", or "epoch". the new approach utilizes the concept of "epoch number", so that each delete bumps up the epoch number. since each write is replicated (replicate on write is almost always enabled in practice, if this is a concern, we could further force ROW in case of delete ), so the epoch number is global to a replica set
> changes are attached, existing tests pass fine, some tests are modified since the semantic is changed a bit. some cql tests do not pass in the original 0.8.0 source, that's not the fault of this change.
> see details at http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTikQcgLSNwtT-9HvqpSeoo7SF58SnA@mail.gmail.com%3E
> the goal of this is to make delete work ( at least with consistent behavior, yes in case of long network partition, the behavior is not ideal, but it's consistent with the definition of logical clock), so that we could have expiring Counters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2774) one way to make counter delete work better

Posted by "Sylvain Lebresne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049688#comment-13049688 ] 

Sylvain Lebresne commented on CASSANDRA-2774:
---------------------------------------------

Consider 2 nodes A, B and C with RF=2 and a given counter c whose replica set is {B, C}.
Consider a single client issuing the following operations (in order) while connected to node A:
# client increment c by +2 at CL.ONE
# client delete c at CL.ONE
# client increment c by +3 at CL.ONE
# client reads c at CL.ALL

The *only* valid answer the client should ever get on its last read is 3.  Any other value is a break of the consistency level contract and not something we can expect people to be happy with. Any other answer means that deletes are broken (and this *is* the problem with the actual implementation).

However, because the write are made at CL.ONE in the example above, at the time the read is issued, the only thing we know for sure is that each write has been received by one node, but not necessarily the same each time.  Depending on the actual timing and on which node happens to be the one acknowledging each writes, when the read reaches the nodes you can have a lot of different situations including:
* A and B both have received the 3 writes in the right order, they will all return 3, the 'right' answer.
* A received the deletion (the two increments are still on the wire yet to be received) and B received the other two increments (the delete is still on the wire yet to be received). A will return the tombstone, B will return 5. You can assign all epoch number you want, there is no way you can return 3 to the client. It will be either 5 or 0.

So the same query will result in different answers depending on the internal timing of events, and will sometimes return an answer that is a break of the contract. Removes of counters are broken and the only safe way to use them is for permanent removal with no following inserts. This patch doesn't fix it.

Btw, it's not too hard to come up with the same kind of example using only QUORUM reads and writes (but you'll need one more replica and a few more steps).


> one way to make counter delete work better
> ------------------------------------------
>
>                 Key: CASSANDRA-2774
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2774
>             Project: Cassandra
>          Issue Type: New Feature
>    Affects Versions: 0.8.0
>            Reporter: Yang Yang
>         Attachments: counter_delete.diff
>
>
> current Counter does not work with delete, because different merging order of sstables would produces different result, for example:
> add 1
> delete 
> add 2
> if the merging happens by 1-2, (1,2)--3  order, the result we see will be 2
> if merging is: 1--3, (1,3)--2, the result will be 3.
> the issue is that delete now can not separate out previous adds and adds later than the delete. supposedly a delete is to create a completely new incarnation of the counter, or a new "lifetime", or "epoch". the new approach utilizes the concept of "epoch number", so that each delete bumps up the epoch number. since each write is replicated (replicate on write is almost always enabled in practice, if this is a concern, we could further force ROW in case of delete ), so the epoch number is global to a replica set
> changes are attached, existing tests pass fine, some tests are modified since the semantic is changed a bit. some cql tests do not pass in the original 0.8.0 source, that's not the fault of this change.
> see details at http://mail-archives.apache.org/mod_mbox/cassandra-user/201106.mbox/%3CBANLkTikQcgLSNwtT-9HvqpSeoo7SF58SnA@mail.gmail.com%3E
> the goal of this is to make delete work ( at least with consistent behavior, yes in case of long network partition, the behavior is not ideal, but it's consistent with the definition of logical clock), so that we could have expiring Counters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira