You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Yang <te...@gmail.com> on 2011/06/13 17:54:07 UTC

one way to make counter delete work better

as https://issues.apache.org/jira/browse/CASSANDRA-2101
indicates, the problem with counter delete is  in scenarios like the
following:

add 1, clock 100
delete , clock 200
add  2 , clock 300

if the 1st and 3rd operations are merged in SStable compaction, then we
have
delete  clock 200
add 3,  clock 300

which shows wrong result.


I think a relatively simple extension can be used to complete fix this
issue: similar to ZooKeeper, we can prefix an "Epoch" number to the clock,
so that
   1) a delete operation increases future epoch number by 1
   2) merging of delta adds can be between only deltas of the same epoch,
deltas of older epoch are simply ignored during merging. merged result keeps
the epoch number of the newest seen.

other operations remain the same as current. note that the above 2 rules are
only concerned with merging within the deltas on the leader, and not related
to the replicated count, which is a simple final state, and observes the
rule of "larger clock trumps". naturally the ordering rule is: epoch1.clock1
> epoch2.clock2  iff epoch1 > epoch2 || epoch1 == epoch2 && clock1 > clock2

intuitively "epoch" can be seen as the serial number on a new "incarnation"
of a counter.


code change should be mostly localized to CounterColumn.reconcile(),
 although, if an update does not find existing entry in memtable, we need to
go to sstable to fetch any possible epoch number, so
compared to current write path, in the "no replicate-on-write" case, we need
to add a read to sstable. but in the "replicate-on-write" case, we already
read that, so it's no extra time cost.  "no replicate-on-write" is not a
very useful setup in reality anyway.


does this sound a feasible way?   if this works, expiring counter should
also naturally work.


Thanks
Yang

Re: one way to make counter delete work better

Posted by Yang <te...@gmail.com>.
in "stronger reason", I mean the +3 is already merged up in memtable of node
B, you can't find +1 and +2 any more



On Tue, Jun 14, 2011 at 7:02 PM, Yang <te...@gmail.com> wrote:

> I almost got the code done, should release in a bit.
>
>
>
> your scenario is not a problem concerned with implementation, but really
> with definition of "same time". remember that in a distributed system, there
> is no absolute physical time concept, time is just another way of saying
> "before or after". in your scenario, since DCA and DCB are cut off, and
> there are no messages between them, you can NOT determine logically whether
> you should say the delete is before +3 or after it. you may say "hey, the
> timestamp I gave +3 is higher", but DCA may say:" your timestamp is just
> drifted, actually my delete happened later"
>
> in fact here is a stronger reason that you have to let go of the +3,
> because it might have already been merged up by +1 , which happened in
> physical time earlier than our DCA delete, and a +2 which happened after the
> DCA delete, now what would you say about whether the +3 is before or after
> our DCA delete? the only correct way to order them is to say:" sorry DCB:
> you missed the delete, all your latter +2 operations were just a snapshot
> earlier in time, the eventual result is the delete. ---- in other words, it
> is futile to update on a dead epoch while others have started a new one".
> this is the same dilemma that you face during sstable merging
>
> overall, I think it's easier to understand it if we realize that once you
> delete, all further edits on the counter is futile, epoch is another way of
> saying creating a completely new counter, the counter name we are using is
> just kind of an alias.
>
>
> yang
>
>
> On Tue, Jun 14, 2011 at 11:21 AM, Sylvain Lebresne <sy...@datastax.com>wrote:
>
>> Who assigns those epoch numbers ?
>> You need all nodes to agree on the epoch number somehow to have this work,
>> but then how do you maintain those in a partition tolerant distributed
>> system ?
>>
>> I may have missed some parts of your proposal but let me consider a
>> scenario
>> that we have to be able to handle: consider two nodes A and B (RF=2) each
>> in
>> one data center (DCA and DCB) and a counter c. Suppose you do a +2
>> increment
>> on c that both nodes get. Now let say you have a network split and the
>> connection
>> between your 2 data center fails. In DCA you delete c, only A gets it.
>> In DCB, you
>> do more increments on c (say +3), only B gets it. The partition can
>> last for hours.
>> For deletion to work, we would need that whenever the network
>> partition is resolved,
>> both node eventually agree on the value 3 (i.e, only the second
>> increment).
>> I don't see how you could assign epoch numbers or anything to fix that.
>>
>> --
>> Sylvain
>>
>> On Mon, Jun 13, 2011 at 8:26 PM, Yang <te...@gmail.com> wrote:
>> > ok, I think it's better to understand it this way, then it is really
>> simple
>> > and intuitive:
>> > my proposed way of counter update can be simply seen as a combination of
>> > regular columns + current counter columns:
>> > regular column :  [ value: "wipes out every bucket to nil"   , clock:
>> epoch
>> > number]
>> > then within each epoch, counter updates work as currently implemented
>> >
>> >
>> > On Mon, Jun 13, 2011 at 10:12 AM, Yang <te...@gmail.com> wrote:
>> >>
>> >> I think this approach also works for your scenario:
>> >> I thought that the issue is only concerned with merging within the same
>> >> leader; but you pointed out
>> >> that a similar merging happens between leaders too, now I see that the
>> >> same rules on epoch number
>> >> also applies to inter-leader data merging, specifically in your case:
>> >>
>> >> everyone starts with epoch of 0, ( they should be same, if not, it also
>> >> works, we just consider them to be representing diffferent time
>> snapshots of
>> >> the same counter state)
>> >> node A      add 1    clock:  0.100  (epoch = 0, clock number = 100)
>> >> node A      delete    clock:  0.200
>> >> node B     add 2     clock:  0.300
>> >> node A    gets B's state:  add 2 clock 0.300, but rejects it because A
>> has
>> >> already produced a delete, with epoch of 0, so A considers epoch 0
>> already
>> >> ended, it won't accept any replicated state with epoch < 1.
>> >> node B    gets A's delete  0.200,  it zeros its own count of "2", and
>> >> updates its future expected epoch to 1.
>> >> at this time, the state of system is:
>> >> node A     expected epoch =1  [A:nil] [B:nil]
>> >> same for node B
>> >>
>> >>
>> >> let's say we have following further writes:
>> >> node B  add 3  clock  1.400
>> >> node A adds 4  clock 1.500
>> >> node B receives A's add 4,   node B updates its copy of A
>> >> node A receives B's add 3,    updates its copy of B
>> >>
>> >> then state is:
>> >> node A  , expected epoch == 1    [A:4  clock=400] [B:3   clock=500]
>> >> node B same
>> >>
>> >>
>> >> generally I think it should be complete if we add the following rule
>> for
>> >> inter-leader replication:
>> >> each leader keeps a var in memory (and also persist to sstable when
>> >> flushing)  expected_epoch , initially set to 0
>> >> node P does:
>> >> on receiving updates from  node Q
>> >>         if Q.expected_epoch > P.expected_epoch
>> >>               /** an epoch bump inherently means a previous delete,
>> which
>> >> we probably missed , so we need to apply the delete
>> >>                   a delete is global to all leaders, so apply it on all
>> my
>> >> replicas **/
>> >>              for all leaders in my vector
>> >>                   count = nil
>> >>
>> >>              P.expected_epoch =  Q.expected_epoch
>> >>         if Q.expected_epoch == P.expected_epoch
>> >>              update P's copy of Q according to standard rules
>> >>         /** if Q.expected_epoch < P.expected_epoch  , that means Q is
>> less
>> >> up to date than us, just ignore
>> >>
>> >> replicate_on_write(to Q):
>> >>       if  P.operation == delete
>> >>             P.expected_epoch ++
>> >>             set all my copies of all leaders to nil
>> >>       send to Q ( P.total , P.expected_epoch)
>> >>
>> >>
>> >>
>> >> overall I don't think delete being not commutative is a fundamental
>> >> blocker : regular columns are also not commutative, yet we achieve
>> stable
>> >> result no matter what order they are applied, because of the ordering
>> rule
>> >> used in reconciliation; here we just need to find a similar ordering
>> rule.
>> >> the epoch thing could be a step on this direction.
>> >>
>> >> Thanks
>> >> Yang
>> >>
>> >>
>> >>
>> >> On Mon, Jun 13, 2011 at 9:04 AM, Jonathan Ellis <jb...@gmail.com>
>> wrote:
>> >>>
>> >>> I don't think that's bulletproof either.  For instance, what if the
>> >>> two adds go to replica 1 but the delete to replica 2?
>> >>>
>> >>> Bottom line (and this was discussed on the original
>> >>> delete-for-counters ticket,
>> >>> https://issues.apache.org/jira/browse/CASSANDRA-2101), counter
>> deletes
>> >>> are not fully commutative which makes them fragile.
>> >>>
>> >>> On Mon, Jun 13, 2011 at 10:54 AM, Yang <te...@gmail.com> wrote:
>> >>> > as https://issues.apache.org/jira/browse/CASSANDRA-2101
>> >>> > indicates, the problem with counter delete is  in scenarios like the
>> >>> > following:
>> >>> > add 1, clock 100
>> >>> > delete , clock 200
>> >>> > add  2 , clock 300
>> >>> > if the 1st and 3rd operations are merged in SStable compaction, then
>> we
>> >>> > have
>> >>> > delete  clock 200
>> >>> > add 3,  clock 300
>> >>> > which shows wrong result.
>> >>> >
>> >>> > I think a relatively simple extension can be used to complete fix
>> this
>> >>> > issue: similar to ZooKeeper, we can prefix an "Epoch" number to the
>> >>> > clock,
>> >>> > so that
>> >>> >    1) a delete operation increases future epoch number by 1
>> >>> >    2) merging of delta adds can be between only deltas of the same
>> >>> > epoch,
>> >>> > deltas of older epoch are simply ignored during merging. merged
>> result
>> >>> > keeps
>> >>> > the epoch number of the newest seen.
>> >>> > other operations remain the same as current. note that the above 2
>> >>> > rules are
>> >>> > only concerned with merging within the deltas on the leader, and not
>> >>> > related
>> >>> > to the replicated count, which is a simple final state, and observes
>> >>> > the
>> >>> > rule of "larger clock trumps". naturally the ordering rule is:
>> >>> > epoch1.clock1
>> >>> >> epoch2.clock2  iff epoch1 > epoch2 || epoch1 == epoch2 && clock1 >
>> >>> >> clock2
>> >>> > intuitively "epoch" can be seen as the serial number on a new
>> >>> > "incarnation"
>> >>> > of a counter.
>> >>> >
>> >>> > code change should be mostly localized to CounterColumn.reconcile(),
>> >>> >  although, if an update does not find existing entry in memtable, we
>> >>> > need to
>> >>> > go to sstable to fetch any possible epoch number, so
>> >>> > compared to current write path, in the "no replicate-on-write" case,
>> we
>> >>> > need
>> >>> > to add a read to sstable. but in the "replicate-on-write" case, we
>> >>> > already
>> >>> > read that, so it's no extra time cost.  "no replicate-on-write" is
>> not
>> >>> > a
>> >>> > very useful setup in reality anyway.
>> >>> >
>> >>> > does this sound a feasible way?   if this works, expiring counter
>> >>> > should
>> >>> > also naturally work.
>> >>> >
>> >>> > Thanks
>> >>> > Yang
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Jonathan Ellis
>> >>> Project Chair, Apache Cassandra
>> >>> co-founder of DataStax, the source for professional Cassandra support
>> >>> http://www.datastax.com
>> >>
>> >
>> >
>>
>
>

Re: one way to make counter delete work better

Posted by Yang <te...@gmail.com>.
I almost got the code done, should release in a bit.



your scenario is not a problem concerned with implementation, but really
with definition of "same time". remember that in a distributed system, there
is no absolute physical time concept, time is just another way of saying
"before or after". in your scenario, since DCA and DCB are cut off, and
there are no messages between them, you can NOT determine logically whether
you should say the delete is before +3 or after it. you may say "hey, the
timestamp I gave +3 is higher", but DCA may say:" your timestamp is just
drifted, actually my delete happened later"

in fact here is a stronger reason that you have to let go of the +3, because
it might have already been merged up by +1 , which happened in physical time
earlier than our DCA delete, and a +2 which happened after the DCA delete,
now what would you say about whether the +3 is before or after our DCA
delete? the only correct way to order them is to say:" sorry DCB: you missed
the delete, all your latter +2 operations were just a snapshot earlier in
time, the eventual result is the delete. ---- in other words, it is futile
to update on a dead epoch while others have started a new one". this is the
same dilemma that you face during sstable merging

overall, I think it's easier to understand it if we realize that once you
delete, all further edits on the counter is futile, epoch is another way of
saying creating a completely new counter, the counter name we are using is
just kind of an alias.


yang


On Tue, Jun 14, 2011 at 11:21 AM, Sylvain Lebresne <sy...@datastax.com>wrote:

> Who assigns those epoch numbers ?
> You need all nodes to agree on the epoch number somehow to have this work,
> but then how do you maintain those in a partition tolerant distributed
> system ?
>
> I may have missed some parts of your proposal but let me consider a
> scenario
> that we have to be able to handle: consider two nodes A and B (RF=2) each
> in
> one data center (DCA and DCB) and a counter c. Suppose you do a +2
> increment
> on c that both nodes get. Now let say you have a network split and the
> connection
> between your 2 data center fails. In DCA you delete c, only A gets it.
> In DCB, you
> do more increments on c (say +3), only B gets it. The partition can
> last for hours.
> For deletion to work, we would need that whenever the network
> partition is resolved,
> both node eventually agree on the value 3 (i.e, only the second increment).
> I don't see how you could assign epoch numbers or anything to fix that.
>
> --
> Sylvain
>
> On Mon, Jun 13, 2011 at 8:26 PM, Yang <te...@gmail.com> wrote:
> > ok, I think it's better to understand it this way, then it is really
> simple
> > and intuitive:
> > my proposed way of counter update can be simply seen as a combination of
> > regular columns + current counter columns:
> > regular column :  [ value: "wipes out every bucket to nil"   , clock:
> epoch
> > number]
> > then within each epoch, counter updates work as currently implemented
> >
> >
> > On Mon, Jun 13, 2011 at 10:12 AM, Yang <te...@gmail.com> wrote:
> >>
> >> I think this approach also works for your scenario:
> >> I thought that the issue is only concerned with merging within the same
> >> leader; but you pointed out
> >> that a similar merging happens between leaders too, now I see that the
> >> same rules on epoch number
> >> also applies to inter-leader data merging, specifically in your case:
> >>
> >> everyone starts with epoch of 0, ( they should be same, if not, it also
> >> works, we just consider them to be representing diffferent time
> snapshots of
> >> the same counter state)
> >> node A      add 1    clock:  0.100  (epoch = 0, clock number = 100)
> >> node A      delete    clock:  0.200
> >> node B     add 2     clock:  0.300
> >> node A    gets B's state:  add 2 clock 0.300, but rejects it because A
> has
> >> already produced a delete, with epoch of 0, so A considers epoch 0
> already
> >> ended, it won't accept any replicated state with epoch < 1.
> >> node B    gets A's delete  0.200,  it zeros its own count of "2", and
> >> updates its future expected epoch to 1.
> >> at this time, the state of system is:
> >> node A     expected epoch =1  [A:nil] [B:nil]
> >> same for node B
> >>
> >>
> >> let's say we have following further writes:
> >> node B  add 3  clock  1.400
> >> node A adds 4  clock 1.500
> >> node B receives A's add 4,   node B updates its copy of A
> >> node A receives B's add 3,    updates its copy of B
> >>
> >> then state is:
> >> node A  , expected epoch == 1    [A:4  clock=400] [B:3   clock=500]
> >> node B same
> >>
> >>
> >> generally I think it should be complete if we add the following rule for
> >> inter-leader replication:
> >> each leader keeps a var in memory (and also persist to sstable when
> >> flushing)  expected_epoch , initially set to 0
> >> node P does:
> >> on receiving updates from  node Q
> >>         if Q.expected_epoch > P.expected_epoch
> >>               /** an epoch bump inherently means a previous delete,
> which
> >> we probably missed , so we need to apply the delete
> >>                   a delete is global to all leaders, so apply it on all
> my
> >> replicas **/
> >>              for all leaders in my vector
> >>                   count = nil
> >>
> >>              P.expected_epoch =  Q.expected_epoch
> >>         if Q.expected_epoch == P.expected_epoch
> >>              update P's copy of Q according to standard rules
> >>         /** if Q.expected_epoch < P.expected_epoch  , that means Q is
> less
> >> up to date than us, just ignore
> >>
> >> replicate_on_write(to Q):
> >>       if  P.operation == delete
> >>             P.expected_epoch ++
> >>             set all my copies of all leaders to nil
> >>       send to Q ( P.total , P.expected_epoch)
> >>
> >>
> >>
> >> overall I don't think delete being not commutative is a fundamental
> >> blocker : regular columns are also not commutative, yet we achieve
> stable
> >> result no matter what order they are applied, because of the ordering
> rule
> >> used in reconciliation; here we just need to find a similar ordering
> rule.
> >> the epoch thing could be a step on this direction.
> >>
> >> Thanks
> >> Yang
> >>
> >>
> >>
> >> On Mon, Jun 13, 2011 at 9:04 AM, Jonathan Ellis <jb...@gmail.com>
> wrote:
> >>>
> >>> I don't think that's bulletproof either.  For instance, what if the
> >>> two adds go to replica 1 but the delete to replica 2?
> >>>
> >>> Bottom line (and this was discussed on the original
> >>> delete-for-counters ticket,
> >>> https://issues.apache.org/jira/browse/CASSANDRA-2101), counter deletes
> >>> are not fully commutative which makes them fragile.
> >>>
> >>> On Mon, Jun 13, 2011 at 10:54 AM, Yang <te...@gmail.com> wrote:
> >>> > as https://issues.apache.org/jira/browse/CASSANDRA-2101
> >>> > indicates, the problem with counter delete is  in scenarios like the
> >>> > following:
> >>> > add 1, clock 100
> >>> > delete , clock 200
> >>> > add  2 , clock 300
> >>> > if the 1st and 3rd operations are merged in SStable compaction, then
> we
> >>> > have
> >>> > delete  clock 200
> >>> > add 3,  clock 300
> >>> > which shows wrong result.
> >>> >
> >>> > I think a relatively simple extension can be used to complete fix
> this
> >>> > issue: similar to ZooKeeper, we can prefix an "Epoch" number to the
> >>> > clock,
> >>> > so that
> >>> >    1) a delete operation increases future epoch number by 1
> >>> >    2) merging of delta adds can be between only deltas of the same
> >>> > epoch,
> >>> > deltas of older epoch are simply ignored during merging. merged
> result
> >>> > keeps
> >>> > the epoch number of the newest seen.
> >>> > other operations remain the same as current. note that the above 2
> >>> > rules are
> >>> > only concerned with merging within the deltas on the leader, and not
> >>> > related
> >>> > to the replicated count, which is a simple final state, and observes
> >>> > the
> >>> > rule of "larger clock trumps". naturally the ordering rule is:
> >>> > epoch1.clock1
> >>> >> epoch2.clock2  iff epoch1 > epoch2 || epoch1 == epoch2 && clock1 >
> >>> >> clock2
> >>> > intuitively "epoch" can be seen as the serial number on a new
> >>> > "incarnation"
> >>> > of a counter.
> >>> >
> >>> > code change should be mostly localized to CounterColumn.reconcile(),
> >>> >  although, if an update does not find existing entry in memtable, we
> >>> > need to
> >>> > go to sstable to fetch any possible epoch number, so
> >>> > compared to current write path, in the "no replicate-on-write" case,
> we
> >>> > need
> >>> > to add a read to sstable. but in the "replicate-on-write" case, we
> >>> > already
> >>> > read that, so it's no extra time cost.  "no replicate-on-write" is
> not
> >>> > a
> >>> > very useful setup in reality anyway.
> >>> >
> >>> > does this sound a feasible way?   if this works, expiring counter
> >>> > should
> >>> > also naturally work.
> >>> >
> >>> > Thanks
> >>> > Yang
> >>>
> >>>
> >>>
> >>> --
> >>> Jonathan Ellis
> >>> Project Chair, Apache Cassandra
> >>> co-founder of DataStax, the source for professional Cassandra support
> >>> http://www.datastax.com
> >>
> >
> >
>

Re: one way to make counter delete work better

Posted by Yang <te...@gmail.com>.
patch in https://issues.apache.org/jira/browse/CASSANDRA-2774

<https://issues.apache.org/jira/browse/CASSANDRA-2774>some coding is messy
and only intended for demonstration only, we could refine it after we agree
this is a feasible way to go.


Thanks
Yang

On Tue, Jun 14, 2011 at 11:21 AM, Sylvain Lebresne <sy...@datastax.com>wrote:

> Who assigns those epoch numbers ?
> You need all nodes to agree on the epoch number somehow to have this work,
> but then how do you maintain those in a partition tolerant distributed
> system ?
>
> I may have missed some parts of your proposal but let me consider a
> scenario
> that we have to be able to handle: consider two nodes A and B (RF=2) each
> in
> one data center (DCA and DCB) and a counter c. Suppose you do a +2
> increment
> on c that both nodes get. Now let say you have a network split and the
> connection
> between your 2 data center fails. In DCA you delete c, only A gets it.
> In DCB, you
> do more increments on c (say +3), only B gets it. The partition can
> last for hours.
> For deletion to work, we would need that whenever the network
> partition is resolved,
> both node eventually agree on the value 3 (i.e, only the second increment).
> I don't see how you could assign epoch numbers or anything to fix that.
>
> --
> Sylvain
>
> On Mon, Jun 13, 2011 at 8:26 PM, Yang <te...@gmail.com> wrote:
> > ok, I think it's better to understand it this way, then it is really
> simple
> > and intuitive:
> > my proposed way of counter update can be simply seen as a combination of
> > regular columns + current counter columns:
> > regular column :  [ value: "wipes out every bucket to nil"   , clock:
> epoch
> > number]
> > then within each epoch, counter updates work as currently implemented
> >
> >
> > On Mon, Jun 13, 2011 at 10:12 AM, Yang <te...@gmail.com> wrote:
> >>
> >> I think this approach also works for your scenario:
> >> I thought that the issue is only concerned with merging within the same
> >> leader; but you pointed out
> >> that a similar merging happens between leaders too, now I see that the
> >> same rules on epoch number
> >> also applies to inter-leader data merging, specifically in your case:
> >>
> >> everyone starts with epoch of 0, ( they should be same, if not, it also
> >> works, we just consider them to be representing diffferent time
> snapshots of
> >> the same counter state)
> >> node A      add 1    clock:  0.100  (epoch = 0, clock number = 100)
> >> node A      delete    clock:  0.200
> >> node B     add 2     clock:  0.300
> >> node A    gets B's state:  add 2 clock 0.300, but rejects it because A
> has
> >> already produced a delete, with epoch of 0, so A considers epoch 0
> already
> >> ended, it won't accept any replicated state with epoch < 1.
> >> node B    gets A's delete  0.200,  it zeros its own count of "2", and
> >> updates its future expected epoch to 1.
> >> at this time, the state of system is:
> >> node A     expected epoch =1  [A:nil] [B:nil]
> >> same for node B
> >>
> >>
> >> let's say we have following further writes:
> >> node B  add 3  clock  1.400
> >> node A adds 4  clock 1.500
> >> node B receives A's add 4,   node B updates its copy of A
> >> node A receives B's add 3,    updates its copy of B
> >>
> >> then state is:
> >> node A  , expected epoch == 1    [A:4  clock=400] [B:3   clock=500]
> >> node B same
> >>
> >>
> >> generally I think it should be complete if we add the following rule for
> >> inter-leader replication:
> >> each leader keeps a var in memory (and also persist to sstable when
> >> flushing)  expected_epoch , initially set to 0
> >> node P does:
> >> on receiving updates from  node Q
> >>         if Q.expected_epoch > P.expected_epoch
> >>               /** an epoch bump inherently means a previous delete,
> which
> >> we probably missed , so we need to apply the delete
> >>                   a delete is global to all leaders, so apply it on all
> my
> >> replicas **/
> >>              for all leaders in my vector
> >>                   count = nil
> >>
> >>              P.expected_epoch =  Q.expected_epoch
> >>         if Q.expected_epoch == P.expected_epoch
> >>              update P's copy of Q according to standard rules
> >>         /** if Q.expected_epoch < P.expected_epoch  , that means Q is
> less
> >> up to date than us, just ignore
> >>
> >> replicate_on_write(to Q):
> >>       if  P.operation == delete
> >>             P.expected_epoch ++
> >>             set all my copies of all leaders to nil
> >>       send to Q ( P.total , P.expected_epoch)
> >>
> >>
> >>
> >> overall I don't think delete being not commutative is a fundamental
> >> blocker : regular columns are also not commutative, yet we achieve
> stable
> >> result no matter what order they are applied, because of the ordering
> rule
> >> used in reconciliation; here we just need to find a similar ordering
> rule.
> >> the epoch thing could be a step on this direction.
> >>
> >> Thanks
> >> Yang
> >>
> >>
> >>
> >> On Mon, Jun 13, 2011 at 9:04 AM, Jonathan Ellis <jb...@gmail.com>
> wrote:
> >>>
> >>> I don't think that's bulletproof either.  For instance, what if the
> >>> two adds go to replica 1 but the delete to replica 2?
> >>>
> >>> Bottom line (and this was discussed on the original
> >>> delete-for-counters ticket,
> >>> https://issues.apache.org/jira/browse/CASSANDRA-2101), counter deletes
> >>> are not fully commutative which makes them fragile.
> >>>
> >>> On Mon, Jun 13, 2011 at 10:54 AM, Yang <te...@gmail.com> wrote:
> >>> > as https://issues.apache.org/jira/browse/CASSANDRA-2101
> >>> > indicates, the problem with counter delete is  in scenarios like the
> >>> > following:
> >>> > add 1, clock 100
> >>> > delete , clock 200
> >>> > add  2 , clock 300
> >>> > if the 1st and 3rd operations are merged in SStable compaction, then
> we
> >>> > have
> >>> > delete  clock 200
> >>> > add 3,  clock 300
> >>> > which shows wrong result.
> >>> >
> >>> > I think a relatively simple extension can be used to complete fix
> this
> >>> > issue: similar to ZooKeeper, we can prefix an "Epoch" number to the
> >>> > clock,
> >>> > so that
> >>> >    1) a delete operation increases future epoch number by 1
> >>> >    2) merging of delta adds can be between only deltas of the same
> >>> > epoch,
> >>> > deltas of older epoch are simply ignored during merging. merged
> result
> >>> > keeps
> >>> > the epoch number of the newest seen.
> >>> > other operations remain the same as current. note that the above 2
> >>> > rules are
> >>> > only concerned with merging within the deltas on the leader, and not
> >>> > related
> >>> > to the replicated count, which is a simple final state, and observes
> >>> > the
> >>> > rule of "larger clock trumps". naturally the ordering rule is:
> >>> > epoch1.clock1
> >>> >> epoch2.clock2  iff epoch1 > epoch2 || epoch1 == epoch2 && clock1 >
> >>> >> clock2
> >>> > intuitively "epoch" can be seen as the serial number on a new
> >>> > "incarnation"
> >>> > of a counter.
> >>> >
> >>> > code change should be mostly localized to CounterColumn.reconcile(),
> >>> >  although, if an update does not find existing entry in memtable, we
> >>> > need to
> >>> > go to sstable to fetch any possible epoch number, so
> >>> > compared to current write path, in the "no replicate-on-write" case,
> we
> >>> > need
> >>> > to add a read to sstable. but in the "replicate-on-write" case, we
> >>> > already
> >>> > read that, so it's no extra time cost.  "no replicate-on-write" is
> not
> >>> > a
> >>> > very useful setup in reality anyway.
> >>> >
> >>> > does this sound a feasible way?   if this works, expiring counter
> >>> > should
> >>> > also naturally work.
> >>> >
> >>> > Thanks
> >>> > Yang
> >>>
> >>>
> >>>
> >>> --
> >>> Jonathan Ellis
> >>> Project Chair, Apache Cassandra
> >>> co-founder of DataStax, the source for professional Cassandra support
> >>> http://www.datastax.com
> >>
> >
> >
>

Re: one way to make counter delete work better

Posted by Yang <te...@gmail.com>.
yes epoch is generated by each node, in the replica set,  upon a delete
operation.

epoch is **global** to the replica set, for one counter, in contrast to
clock, with is local to partition.
different counters have different epoch numbers , because different counters
can be seen as completely different state machines, you can view
the nodes in the RF as acting as a separate node for each counter, i.e.
there are millions of replica set, separately, each for one counter

in fact we already have the epoch concept here, just in the
timestampOfLastDelete, but the latter is used in a wrong way, it should
never be compared to timestamp().




On Tue, Jun 14, 2011 at 12:26 PM, Milind Parikh <mi...@gmail.com>wrote:

> If I understand this correctly, then the epoch integer would be
> generated by each node. Since time always flows forward, the assumption
> would be, I suppose, that the epochs would be tagged with the node that
> generated them and additionally the counter would carry as much history as
> necessary (and presumably not all history at all times).
>
> Milind
>
>
> On Tue, Jun 14, 2011 at 2:21 PM, Sylvain Lebresne <sy...@datastax.com>wrote:
>
>> Who assigns those epoch numbers ?
>> You need all nodes to agree on the epoch number somehow to have this work,
>> but then how do you maintain those in a partition tolerant distributed
>> system ?
>>
>> I may have missed some parts of your proposal but let me consider a
>> scenario
>> that we have to be able to handle: consider two nodes A and B (RF=2) each
>> in
>> one data center (DCA and DCB) and a counter c. Suppose you do a +2
>> increment
>> on c that both nodes get. Now let say you have a network split and the
>> connection
>> between your 2 data center fails. In DCA you delete c, only A gets it.
>> In DCB, you
>> do more increments on c (say +3), only B gets it. The partition can
>> last for hours.
>> For deletion to work, we would need that whenever the network
>> partition is resolved,
>> both node eventually agree on the value 3 (i.e, only the second
>> increment).
>> I don't see how you could assign epoch numbers or anything to fix that.
>>
>> --
>> Sylvain
>>
>> On Mon, Jun 13, 2011 at 8:26 PM, Yang <te...@gmail.com> wrote:
>> > ok, I think it's better to understand it this way, then it is really
>> simple
>> > and intuitive:
>> > my proposed way of counter update can be simply seen as a combination of
>> > regular columns + current counter columns:
>> > regular column :  [ value: "wipes out every bucket to nil"   , clock:
>> epoch
>> > number]
>> > then within each epoch, counter updates work as currently implemented
>> >
>> >
>> > On Mon, Jun 13, 2011 at 10:12 AM, Yang <te...@gmail.com> wrote:
>> >>
>> >> I think this approach also works for your scenario:
>> >> I thought that the issue is only concerned with merging within the same
>> >> leader; but you pointed out
>> >> that a similar merging happens between leaders too, now I see that the
>> >> same rules on epoch number
>> >> also applies to inter-leader data merging, specifically in your case:
>> >>
>> >> everyone starts with epoch of 0, ( they should be same, if not, it also
>> >> works, we just consider them to be representing diffferent time
>> snapshots of
>> >> the same counter state)
>> >> node A      add 1    clock:  0.100  (epoch = 0, clock number = 100)
>> >> node A      delete    clock:  0.200
>> >> node B     add 2     clock:  0.300
>> >> node A    gets B's state:  add 2 clock 0.300, but rejects it because A
>> has
>> >> already produced a delete, with epoch of 0, so A considers epoch 0
>> already
>> >> ended, it won't accept any replicated state with epoch < 1.
>> >> node B    gets A's delete  0.200,  it zeros its own count of "2", and
>> >> updates its future expected epoch to 1.
>> >> at this time, the state of system is:
>> >> node A     expected epoch =1  [A:nil] [B:nil]
>> >> same for node B
>> >>
>> >>
>> >> let's say we have following further writes:
>> >> node B  add 3  clock  1.400
>> >> node A adds 4  clock 1.500
>> >> node B receives A's add 4,   node B updates its copy of A
>> >> node A receives B's add 3,    updates its copy of B
>> >>
>> >> then state is:
>> >> node A  , expected epoch == 1    [A:4  clock=400] [B:3   clock=500]
>> >> node B same
>> >>
>> >>
>> >> generally I think it should be complete if we add the following rule
>> for
>> >> inter-leader replication:
>> >> each leader keeps a var in memory (and also persist to sstable when
>> >> flushing)  expected_epoch , initially set to 0
>> >> node P does:
>> >> on receiving updates from  node Q
>> >>         if Q.expected_epoch > P.expected_epoch
>> >>               /** an epoch bump inherently means a previous delete,
>> which
>> >> we probably missed , so we need to apply the delete
>> >>                   a delete is global to all leaders, so apply it on all
>> my
>> >> replicas **/
>> >>              for all leaders in my vector
>> >>                   count = nil
>> >>
>> >>              P.expected_epoch =  Q.expected_epoch
>> >>         if Q.expected_epoch == P.expected_epoch
>> >>              update P's copy of Q according to standard rules
>> >>         /** if Q.expected_epoch < P.expected_epoch  , that means Q is
>> less
>> >> up to date than us, just ignore
>> >>
>> >> replicate_on_write(to Q):
>> >>       if  P.operation == delete
>> >>             P.expected_epoch ++
>> >>             set all my copies of all leaders to nil
>> >>       send to Q ( P.total , P.expected_epoch)
>> >>
>> >>
>> >>
>> >> overall I don't think delete being not commutative is a fundamental
>> >> blocker : regular columns are also not commutative, yet we achieve
>> stable
>> >> result no matter what order they are applied, because of the ordering
>> rule
>> >> used in reconciliation; here we just need to find a similar ordering
>> rule.
>> >> the epoch thing could be a step on this direction.
>> >>
>> >> Thanks
>> >> Yang
>> >>
>> >>
>> >>
>> >> On Mon, Jun 13, 2011 at 9:04 AM, Jonathan Ellis <jb...@gmail.com>
>> wrote:
>> >>>
>> >>> I don't think that's bulletproof either.  For instance, what if the
>> >>> two adds go to replica 1 but the delete to replica 2?
>> >>>
>> >>> Bottom line (and this was discussed on the original
>> >>> delete-for-counters ticket,
>> >>> https://issues.apache.org/jira/browse/CASSANDRA-2101), counter
>> deletes
>> >>> are not fully commutative which makes them fragile.
>> >>>
>> >>> On Mon, Jun 13, 2011 at 10:54 AM, Yang <te...@gmail.com> wrote:
>> >>> > as https://issues.apache.org/jira/browse/CASSANDRA-2101
>> >>> > indicates, the problem with counter delete is  in scenarios like the
>> >>> > following:
>> >>> > add 1, clock 100
>> >>> > delete , clock 200
>> >>> > add  2 , clock 300
>> >>> > if the 1st and 3rd operations are merged in SStable compaction, then
>> we
>> >>> > have
>> >>> > delete  clock 200
>> >>> > add 3,  clock 300
>> >>> > which shows wrong result.
>> >>> >
>> >>> > I think a relatively simple extension can be used to complete fix
>> this
>> >>> > issue: similar to ZooKeeper, we can prefix an "Epoch" number to the
>> >>> > clock,
>> >>> > so that
>> >>> >    1) a delete operation increases future epoch number by 1
>> >>> >    2) merging of delta adds can be between only deltas of the same
>> >>> > epoch,
>> >>> > deltas of older epoch are simply ignored during merging. merged
>> result
>> >>> > keeps
>> >>> > the epoch number of the newest seen.
>> >>> > other operations remain the same as current. note that the above 2
>> >>> > rules are
>> >>> > only concerned with merging within the deltas on the leader, and not
>> >>> > related
>> >>> > to the replicated count, which is a simple final state, and observes
>> >>> > the
>> >>> > rule of "larger clock trumps". naturally the ordering rule is:
>> >>> > epoch1.clock1
>> >>> >> epoch2.clock2  iff epoch1 > epoch2 || epoch1 == epoch2 && clock1 >
>> >>> >> clock2
>> >>> > intuitively "epoch" can be seen as the serial number on a new
>> >>> > "incarnation"
>> >>> > of a counter.
>> >>> >
>> >>> > code change should be mostly localized to CounterColumn.reconcile(),
>> >>> >  although, if an update does not find existing entry in memtable, we
>> >>> > need to
>> >>> > go to sstable to fetch any possible epoch number, so
>> >>> > compared to current write path, in the "no replicate-on-write" case,
>> we
>> >>> > need
>> >>> > to add a read to sstable. but in the "replicate-on-write" case, we
>> >>> > already
>> >>> > read that, so it's no extra time cost.  "no replicate-on-write" is
>> not
>> >>> > a
>> >>> > very useful setup in reality anyway.
>> >>> >
>> >>> > does this sound a feasible way?   if this works, expiring counter
>> >>> > should
>> >>> > also naturally work.
>> >>> >
>> >>> > Thanks
>> >>> > Yang
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Jonathan Ellis
>> >>> Project Chair, Apache Cassandra
>> >>> co-founder of DataStax, the source for professional Cassandra support
>> >>> http://www.datastax.com
>> >>
>> >
>> >
>>
>
>

Re: one way to make counter delete work better

Posted by Milind Parikh <mi...@gmail.com>.
If I understand this correctly, then the epoch integer would be generated by
each node. Since time always flows forward, the assumption would be, I
suppose, that the epochs would be tagged with the node that generated them
and additionally the counter would carry as much history as necessary (and
presumably not all history at all times).

Milind


On Tue, Jun 14, 2011 at 2:21 PM, Sylvain Lebresne <sy...@datastax.com>wrote:

> Who assigns those epoch numbers ?
> You need all nodes to agree on the epoch number somehow to have this work,
> but then how do you maintain those in a partition tolerant distributed
> system ?
>
> I may have missed some parts of your proposal but let me consider a
> scenario
> that we have to be able to handle: consider two nodes A and B (RF=2) each
> in
> one data center (DCA and DCB) and a counter c. Suppose you do a +2
> increment
> on c that both nodes get. Now let say you have a network split and the
> connection
> between your 2 data center fails. In DCA you delete c, only A gets it.
> In DCB, you
> do more increments on c (say +3), only B gets it. The partition can
> last for hours.
> For deletion to work, we would need that whenever the network
> partition is resolved,
> both node eventually agree on the value 3 (i.e, only the second increment).
> I don't see how you could assign epoch numbers or anything to fix that.
>
> --
> Sylvain
>
> On Mon, Jun 13, 2011 at 8:26 PM, Yang <te...@gmail.com> wrote:
> > ok, I think it's better to understand it this way, then it is really
> simple
> > and intuitive:
> > my proposed way of counter update can be simply seen as a combination of
> > regular columns + current counter columns:
> > regular column :  [ value: "wipes out every bucket to nil"   , clock:
> epoch
> > number]
> > then within each epoch, counter updates work as currently implemented
> >
> >
> > On Mon, Jun 13, 2011 at 10:12 AM, Yang <te...@gmail.com> wrote:
> >>
> >> I think this approach also works for your scenario:
> >> I thought that the issue is only concerned with merging within the same
> >> leader; but you pointed out
> >> that a similar merging happens between leaders too, now I see that the
> >> same rules on epoch number
> >> also applies to inter-leader data merging, specifically in your case:
> >>
> >> everyone starts with epoch of 0, ( they should be same, if not, it also
> >> works, we just consider them to be representing diffferent time
> snapshots of
> >> the same counter state)
> >> node A      add 1    clock:  0.100  (epoch = 0, clock number = 100)
> >> node A      delete    clock:  0.200
> >> node B     add 2     clock:  0.300
> >> node A    gets B's state:  add 2 clock 0.300, but rejects it because A
> has
> >> already produced a delete, with epoch of 0, so A considers epoch 0
> already
> >> ended, it won't accept any replicated state with epoch < 1.
> >> node B    gets A's delete  0.200,  it zeros its own count of "2", and
> >> updates its future expected epoch to 1.
> >> at this time, the state of system is:
> >> node A     expected epoch =1  [A:nil] [B:nil]
> >> same for node B
> >>
> >>
> >> let's say we have following further writes:
> >> node B  add 3  clock  1.400
> >> node A adds 4  clock 1.500
> >> node B receives A's add 4,   node B updates its copy of A
> >> node A receives B's add 3,    updates its copy of B
> >>
> >> then state is:
> >> node A  , expected epoch == 1    [A:4  clock=400] [B:3   clock=500]
> >> node B same
> >>
> >>
> >> generally I think it should be complete if we add the following rule for
> >> inter-leader replication:
> >> each leader keeps a var in memory (and also persist to sstable when
> >> flushing)  expected_epoch , initially set to 0
> >> node P does:
> >> on receiving updates from  node Q
> >>         if Q.expected_epoch > P.expected_epoch
> >>               /** an epoch bump inherently means a previous delete,
> which
> >> we probably missed , so we need to apply the delete
> >>                   a delete is global to all leaders, so apply it on all
> my
> >> replicas **/
> >>              for all leaders in my vector
> >>                   count = nil
> >>
> >>              P.expected_epoch =  Q.expected_epoch
> >>         if Q.expected_epoch == P.expected_epoch
> >>              update P's copy of Q according to standard rules
> >>         /** if Q.expected_epoch < P.expected_epoch  , that means Q is
> less
> >> up to date than us, just ignore
> >>
> >> replicate_on_write(to Q):
> >>       if  P.operation == delete
> >>             P.expected_epoch ++
> >>             set all my copies of all leaders to nil
> >>       send to Q ( P.total , P.expected_epoch)
> >>
> >>
> >>
> >> overall I don't think delete being not commutative is a fundamental
> >> blocker : regular columns are also not commutative, yet we achieve
> stable
> >> result no matter what order they are applied, because of the ordering
> rule
> >> used in reconciliation; here we just need to find a similar ordering
> rule.
> >> the epoch thing could be a step on this direction.
> >>
> >> Thanks
> >> Yang
> >>
> >>
> >>
> >> On Mon, Jun 13, 2011 at 9:04 AM, Jonathan Ellis <jb...@gmail.com>
> wrote:
> >>>
> >>> I don't think that's bulletproof either.  For instance, what if the
> >>> two adds go to replica 1 but the delete to replica 2?
> >>>
> >>> Bottom line (and this was discussed on the original
> >>> delete-for-counters ticket,
> >>> https://issues.apache.org/jira/browse/CASSANDRA-2101), counter deletes
> >>> are not fully commutative which makes them fragile.
> >>>
> >>> On Mon, Jun 13, 2011 at 10:54 AM, Yang <te...@gmail.com> wrote:
> >>> > as https://issues.apache.org/jira/browse/CASSANDRA-2101
> >>> > indicates, the problem with counter delete is  in scenarios like the
> >>> > following:
> >>> > add 1, clock 100
> >>> > delete , clock 200
> >>> > add  2 , clock 300
> >>> > if the 1st and 3rd operations are merged in SStable compaction, then
> we
> >>> > have
> >>> > delete  clock 200
> >>> > add 3,  clock 300
> >>> > which shows wrong result.
> >>> >
> >>> > I think a relatively simple extension can be used to complete fix
> this
> >>> > issue: similar to ZooKeeper, we can prefix an "Epoch" number to the
> >>> > clock,
> >>> > so that
> >>> >    1) a delete operation increases future epoch number by 1
> >>> >    2) merging of delta adds can be between only deltas of the same
> >>> > epoch,
> >>> > deltas of older epoch are simply ignored during merging. merged
> result
> >>> > keeps
> >>> > the epoch number of the newest seen.
> >>> > other operations remain the same as current. note that the above 2
> >>> > rules are
> >>> > only concerned with merging within the deltas on the leader, and not
> >>> > related
> >>> > to the replicated count, which is a simple final state, and observes
> >>> > the
> >>> > rule of "larger clock trumps". naturally the ordering rule is:
> >>> > epoch1.clock1
> >>> >> epoch2.clock2  iff epoch1 > epoch2 || epoch1 == epoch2 && clock1 >
> >>> >> clock2
> >>> > intuitively "epoch" can be seen as the serial number on a new
> >>> > "incarnation"
> >>> > of a counter.
> >>> >
> >>> > code change should be mostly localized to CounterColumn.reconcile(),
> >>> >  although, if an update does not find existing entry in memtable, we
> >>> > need to
> >>> > go to sstable to fetch any possible epoch number, so
> >>> > compared to current write path, in the "no replicate-on-write" case,
> we
> >>> > need
> >>> > to add a read to sstable. but in the "replicate-on-write" case, we
> >>> > already
> >>> > read that, so it's no extra time cost.  "no replicate-on-write" is
> not
> >>> > a
> >>> > very useful setup in reality anyway.
> >>> >
> >>> > does this sound a feasible way?   if this works, expiring counter
> >>> > should
> >>> > also naturally work.
> >>> >
> >>> > Thanks
> >>> > Yang
> >>>
> >>>
> >>>
> >>> --
> >>> Jonathan Ellis
> >>> Project Chair, Apache Cassandra
> >>> co-founder of DataStax, the source for professional Cassandra support
> >>> http://www.datastax.com
> >>
> >
> >
>

Re: one way to make counter delete work better

Posted by Sylvain Lebresne <sy...@datastax.com>.
Who assigns those epoch numbers ?
You need all nodes to agree on the epoch number somehow to have this work,
but then how do you maintain those in a partition tolerant distributed system ?

I may have missed some parts of your proposal but let me consider a scenario
that we have to be able to handle: consider two nodes A and B (RF=2) each in
one data center (DCA and DCB) and a counter c. Suppose you do a +2 increment
on c that both nodes get. Now let say you have a network split and the
connection
between your 2 data center fails. In DCA you delete c, only A gets it.
In DCB, you
do more increments on c (say +3), only B gets it. The partition can
last for hours.
For deletion to work, we would need that whenever the network
partition is resolved,
both node eventually agree on the value 3 (i.e, only the second increment).
I don't see how you could assign epoch numbers or anything to fix that.

--
Sylvain

On Mon, Jun 13, 2011 at 8:26 PM, Yang <te...@gmail.com> wrote:
> ok, I think it's better to understand it this way, then it is really simple
> and intuitive:
> my proposed way of counter update can be simply seen as a combination of
> regular columns + current counter columns:
> regular column :  [ value: "wipes out every bucket to nil"   , clock: epoch
> number]
> then within each epoch, counter updates work as currently implemented
>
>
> On Mon, Jun 13, 2011 at 10:12 AM, Yang <te...@gmail.com> wrote:
>>
>> I think this approach also works for your scenario:
>> I thought that the issue is only concerned with merging within the same
>> leader; but you pointed out
>> that a similar merging happens between leaders too, now I see that the
>> same rules on epoch number
>> also applies to inter-leader data merging, specifically in your case:
>>
>> everyone starts with epoch of 0, ( they should be same, if not, it also
>> works, we just consider them to be representing diffferent time snapshots of
>> the same counter state)
>> node A      add 1    clock:  0.100  (epoch = 0, clock number = 100)
>> node A      delete    clock:  0.200
>> node B     add 2     clock:  0.300
>> node A    gets B's state:  add 2 clock 0.300, but rejects it because A has
>> already produced a delete, with epoch of 0, so A considers epoch 0 already
>> ended, it won't accept any replicated state with epoch < 1.
>> node B    gets A's delete  0.200,  it zeros its own count of "2", and
>> updates its future expected epoch to 1.
>> at this time, the state of system is:
>> node A     expected epoch =1  [A:nil] [B:nil]
>> same for node B
>>
>>
>> let's say we have following further writes:
>> node B  add 3  clock  1.400
>> node A adds 4  clock 1.500
>> node B receives A's add 4,   node B updates its copy of A
>> node A receives B's add 3,    updates its copy of B
>>
>> then state is:
>> node A  , expected epoch == 1    [A:4  clock=400] [B:3   clock=500]
>> node B same
>>
>>
>> generally I think it should be complete if we add the following rule for
>> inter-leader replication:
>> each leader keeps a var in memory (and also persist to sstable when
>> flushing)  expected_epoch , initially set to 0
>> node P does:
>> on receiving updates from  node Q
>>         if Q.expected_epoch > P.expected_epoch
>>               /** an epoch bump inherently means a previous delete, which
>> we probably missed , so we need to apply the delete
>>                   a delete is global to all leaders, so apply it on all my
>> replicas **/
>>              for all leaders in my vector
>>                   count = nil
>>
>>              P.expected_epoch =  Q.expected_epoch
>>         if Q.expected_epoch == P.expected_epoch
>>              update P's copy of Q according to standard rules
>>         /** if Q.expected_epoch < P.expected_epoch  , that means Q is less
>> up to date than us, just ignore
>>
>> replicate_on_write(to Q):
>>       if  P.operation == delete
>>             P.expected_epoch ++
>>             set all my copies of all leaders to nil
>>       send to Q ( P.total , P.expected_epoch)
>>
>>
>>
>> overall I don't think delete being not commutative is a fundamental
>> blocker : regular columns are also not commutative, yet we achieve stable
>> result no matter what order they are applied, because of the ordering rule
>> used in reconciliation; here we just need to find a similar ordering rule.
>> the epoch thing could be a step on this direction.
>>
>> Thanks
>> Yang
>>
>>
>>
>> On Mon, Jun 13, 2011 at 9:04 AM, Jonathan Ellis <jb...@gmail.com> wrote:
>>>
>>> I don't think that's bulletproof either.  For instance, what if the
>>> two adds go to replica 1 but the delete to replica 2?
>>>
>>> Bottom line (and this was discussed on the original
>>> delete-for-counters ticket,
>>> https://issues.apache.org/jira/browse/CASSANDRA-2101), counter deletes
>>> are not fully commutative which makes them fragile.
>>>
>>> On Mon, Jun 13, 2011 at 10:54 AM, Yang <te...@gmail.com> wrote:
>>> > as https://issues.apache.org/jira/browse/CASSANDRA-2101
>>> > indicates, the problem with counter delete is  in scenarios like the
>>> > following:
>>> > add 1, clock 100
>>> > delete , clock 200
>>> > add  2 , clock 300
>>> > if the 1st and 3rd operations are merged in SStable compaction, then we
>>> > have
>>> > delete  clock 200
>>> > add 3,  clock 300
>>> > which shows wrong result.
>>> >
>>> > I think a relatively simple extension can be used to complete fix this
>>> > issue: similar to ZooKeeper, we can prefix an "Epoch" number to the
>>> > clock,
>>> > so that
>>> >    1) a delete operation increases future epoch number by 1
>>> >    2) merging of delta adds can be between only deltas of the same
>>> > epoch,
>>> > deltas of older epoch are simply ignored during merging. merged result
>>> > keeps
>>> > the epoch number of the newest seen.
>>> > other operations remain the same as current. note that the above 2
>>> > rules are
>>> > only concerned with merging within the deltas on the leader, and not
>>> > related
>>> > to the replicated count, which is a simple final state, and observes
>>> > the
>>> > rule of "larger clock trumps". naturally the ordering rule is:
>>> > epoch1.clock1
>>> >> epoch2.clock2  iff epoch1 > epoch2 || epoch1 == epoch2 && clock1 >
>>> >> clock2
>>> > intuitively "epoch" can be seen as the serial number on a new
>>> > "incarnation"
>>> > of a counter.
>>> >
>>> > code change should be mostly localized to CounterColumn.reconcile(),
>>> >  although, if an update does not find existing entry in memtable, we
>>> > need to
>>> > go to sstable to fetch any possible epoch number, so
>>> > compared to current write path, in the "no replicate-on-write" case, we
>>> > need
>>> > to add a read to sstable. but in the "replicate-on-write" case, we
>>> > already
>>> > read that, so it's no extra time cost.  "no replicate-on-write" is not
>>> > a
>>> > very useful setup in reality anyway.
>>> >
>>> > does this sound a feasible way?   if this works, expiring counter
>>> > should
>>> > also naturally work.
>>> >
>>> > Thanks
>>> > Yang
>>>
>>>
>>>
>>> --
>>> Jonathan Ellis
>>> Project Chair, Apache Cassandra
>>> co-founder of DataStax, the source for professional Cassandra support
>>> http://www.datastax.com
>>
>
>

Re: one way to make counter delete work better

Posted by Yang <te...@gmail.com>.
ok, I think it's better to understand it this way, then it is really simple
and intuitive:

my proposed way of counter update can be simply seen as a combination of
regular columns + current counter columns:

regular column :  [ value: "wipes out every bucket to nil"   , clock: epoch
number]
then within each epoch, counter updates work as currently implemented



On Mon, Jun 13, 2011 at 10:12 AM, Yang <te...@gmail.com> wrote:

> I think this approach also works for your scenario:
>
> I thought that the issue is only concerned with merging within the same
> leader; but you pointed out
> that a similar merging happens between leaders too, now I see that the same
> rules on epoch number
> also applies to inter-leader data merging, specifically in your case:
>
>
> everyone starts with epoch of 0, ( they should be same, if not, it also
> works, we just consider them to be representing diffferent time snapshots of
> the same counter state)
>
> node A      add 1    clock:  0.100  (epoch = 0, clock number = 100)
>
> node A      delete    clock:  0.200
>
> node B     add 2     clock:  0.300
>
> node A    gets B's state:  add 2 clock 0.300, but rejects it because A has
> already produced a delete, with epoch of 0, so A considers epoch 0 already
> ended, it won't accept any replicated state with epoch < 1.
>
> node B    gets A's delete  0.200,  it zeros its own count of "2", and
> updates its future expected epoch to 1.
>
> at this time, the state of system is:
> node A     expected epoch =1  [A:nil] [B:nil]
> same for node B
>
>
>
> let's say we have following further writes:
>
> node B  add 3  clock  1.400
>
> node A adds 4  clock 1.500
>
> node B receives A's add 4,   node B updates its copy of A
> node A receives B's add 3,    updates its copy of B
>
>
> then state is:
> node A  , expected epoch == 1    [A:4  clock=400] [B:3   clock=500]
> node B same
>
>
>
> generally I think it should be complete if we add the following rule for
> inter-leader replication:
>
> each leader keeps a var in memory (and also persist to sstable when
> flushing)  expected_epoch , initially set to 0
>
> node P does:
> on receiving updates from  node Q
>         if Q.expected_epoch > P.expected_epoch
>               /** an epoch bump inherently means a previous delete, which
> we probably missed , so we need to apply the delete
>                   a delete is global to all leaders, so apply it on all my
> replicas **/
>              for all leaders in my vector
>                   count = nil
>
>              P.expected_epoch =  Q.expected_epoch
>         if Q.expected_epoch == P.expected_epoch
>              update P's copy of Q according to standard rules
>         /** if Q.expected_epoch < P.expected_epoch  , that means Q is less
> up to date than us, just ignore
>
>
> replicate_on_write(to Q):
>       if  P.operation == delete
>             P.expected_epoch ++
>             set all my copies of all leaders to nil
>       send to Q ( P.total , P.expected_epoch)
>
>
>
>
> overall I don't think delete being not commutative is a fundamental blocker
> : regular columns are also not commutative, yet we achieve stable result no
> matter what order they are applied, because of the ordering rule used in
> reconciliation; here we just need to find a similar ordering rule. the epoch
> thing could be a step on this direction.
>
>
> Thanks
> Yang
>
>
>
>
> On Mon, Jun 13, 2011 at 9:04 AM, Jonathan Ellis <jb...@gmail.com> wrote:
>
>> I don't think that's bulletproof either.  For instance, what if the
>> two adds go to replica 1 but the delete to replica 2?
>>
>> Bottom line (and this was discussed on the original
>> delete-for-counters ticket,
>> https://issues.apache.org/jira/browse/CASSANDRA-2101), counter deletes
>> are not fully commutative which makes them fragile.
>>
>> On Mon, Jun 13, 2011 at 10:54 AM, Yang <te...@gmail.com> wrote:
>> > as https://issues.apache.org/jira/browse/CASSANDRA-2101
>> > indicates, the problem with counter delete is  in scenarios like the
>> > following:
>> > add 1, clock 100
>> > delete , clock 200
>> > add  2 , clock 300
>> > if the 1st and 3rd operations are merged in SStable compaction, then we
>> > have
>> > delete  clock 200
>> > add 3,  clock 300
>> > which shows wrong result.
>> >
>> > I think a relatively simple extension can be used to complete fix this
>> > issue: similar to ZooKeeper, we can prefix an "Epoch" number to the
>> clock,
>> > so that
>> >    1) a delete operation increases future epoch number by 1
>> >    2) merging of delta adds can be between only deltas of the same
>> epoch,
>> > deltas of older epoch are simply ignored during merging. merged result
>> keeps
>> > the epoch number of the newest seen.
>> > other operations remain the same as current. note that the above 2 rules
>> are
>> > only concerned with merging within the deltas on the leader, and not
>> related
>> > to the replicated count, which is a simple final state, and observes the
>> > rule of "larger clock trumps". naturally the ordering rule is:
>> epoch1.clock1
>> >> epoch2.clock2  iff epoch1 > epoch2 || epoch1 == epoch2 && clock1 >
>> clock2
>> > intuitively "epoch" can be seen as the serial number on a new
>> "incarnation"
>> > of a counter.
>> >
>> > code change should be mostly localized to CounterColumn.reconcile(),
>> >  although, if an update does not find existing entry in memtable, we
>> need to
>> > go to sstable to fetch any possible epoch number, so
>> > compared to current write path, in the "no replicate-on-write" case, we
>> need
>> > to add a read to sstable. but in the "replicate-on-write" case, we
>> already
>> > read that, so it's no extra time cost.  "no replicate-on-write" is not a
>> > very useful setup in reality anyway.
>> >
>> > does this sound a feasible way?   if this works, expiring counter should
>> > also naturally work.
>> >
>> > Thanks
>> > Yang
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of DataStax, the source for professional Cassandra support
>> http://www.datastax.com
>>
>
>

Re: one way to make counter delete work better

Posted by Yang <te...@gmail.com>.
I think this approach also works for your scenario:

I thought that the issue is only concerned with merging within the same
leader; but you pointed out
that a similar merging happens between leaders too, now I see that the same
rules on epoch number
also applies to inter-leader data merging, specifically in your case:


everyone starts with epoch of 0, ( they should be same, if not, it also
works, we just consider them to be representing diffferent time snapshots of
the same counter state)

node A      add 1    clock:  0.100  (epoch = 0, clock number = 100)

node A      delete    clock:  0.200

node B     add 2     clock:  0.300

node A    gets B's state:  add 2 clock 0.300, but rejects it because A has
already produced a delete, with epoch of 0, so A considers epoch 0 already
ended, it won't accept any replicated state with epoch < 1.

node B    gets A's delete  0.200,  it zeros its own count of "2", and
updates its future expected epoch to 1.

at this time, the state of system is:
node A     expected epoch =1  [A:nil] [B:nil]
same for node B



let's say we have following further writes:

node B  add 3  clock  1.400

node A adds 4  clock 1.500

node B receives A's add 4,   node B updates its copy of A
node A receives B's add 3,    updates its copy of B


then state is:
node A  , expected epoch == 1    [A:4  clock=400] [B:3   clock=500]
node B same



generally I think it should be complete if we add the following rule for
inter-leader replication:

each leader keeps a var in memory (and also persist to sstable when
flushing)  expected_epoch , initially set to 0

node P does:
on receiving updates from  node Q
        if Q.expected_epoch > P.expected_epoch
              /** an epoch bump inherently means a previous delete, which we
probably missed , so we need to apply the delete
                  a delete is global to all leaders, so apply it on all my
replicas **/
             for all leaders in my vector
                  count = nil

             P.expected_epoch =  Q.expected_epoch
        if Q.expected_epoch == P.expected_epoch
             update P's copy of Q according to standard rules
        /** if Q.expected_epoch < P.expected_epoch  , that means Q is less
up to date than us, just ignore


replicate_on_write(to Q):
      if  P.operation == delete
            P.expected_epoch ++
            set all my copies of all leaders to nil
      send to Q ( P.total , P.expected_epoch)




overall I don't think delete being not commutative is a fundamental blocker
: regular columns are also not commutative, yet we achieve stable result no
matter what order they are applied, because of the ordering rule used in
reconciliation; here we just need to find a similar ordering rule. the epoch
thing could be a step on this direction.


Thanks
Yang




On Mon, Jun 13, 2011 at 9:04 AM, Jonathan Ellis <jb...@gmail.com> wrote:

> I don't think that's bulletproof either.  For instance, what if the
> two adds go to replica 1 but the delete to replica 2?
>
> Bottom line (and this was discussed on the original
> delete-for-counters ticket,
> https://issues.apache.org/jira/browse/CASSANDRA-2101), counter deletes
> are not fully commutative which makes them fragile.
>
> On Mon, Jun 13, 2011 at 10:54 AM, Yang <te...@gmail.com> wrote:
> > as https://issues.apache.org/jira/browse/CASSANDRA-2101
> > indicates, the problem with counter delete is  in scenarios like the
> > following:
> > add 1, clock 100
> > delete , clock 200
> > add  2 , clock 300
> > if the 1st and 3rd operations are merged in SStable compaction, then we
> > have
> > delete  clock 200
> > add 3,  clock 300
> > which shows wrong result.
> >
> > I think a relatively simple extension can be used to complete fix this
> > issue: similar to ZooKeeper, we can prefix an "Epoch" number to the
> clock,
> > so that
> >    1) a delete operation increases future epoch number by 1
> >    2) merging of delta adds can be between only deltas of the same epoch,
> > deltas of older epoch are simply ignored during merging. merged result
> keeps
> > the epoch number of the newest seen.
> > other operations remain the same as current. note that the above 2 rules
> are
> > only concerned with merging within the deltas on the leader, and not
> related
> > to the replicated count, which is a simple final state, and observes the
> > rule of "larger clock trumps". naturally the ordering rule is:
> epoch1.clock1
> >> epoch2.clock2  iff epoch1 > epoch2 || epoch1 == epoch2 && clock1 >
> clock2
> > intuitively "epoch" can be seen as the serial number on a new
> "incarnation"
> > of a counter.
> >
> > code change should be mostly localized to CounterColumn.reconcile(),
> >  although, if an update does not find existing entry in memtable, we need
> to
> > go to sstable to fetch any possible epoch number, so
> > compared to current write path, in the "no replicate-on-write" case, we
> need
> > to add a read to sstable. but in the "replicate-on-write" case, we
> already
> > read that, so it's no extra time cost.  "no replicate-on-write" is not a
> > very useful setup in reality anyway.
> >
> > does this sound a feasible way?   if this works, expiring counter should
> > also naturally work.
> >
> > Thanks
> > Yang
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>

Re: one way to make counter delete work better

Posted by Jonathan Ellis <jb...@gmail.com>.
I don't think that's bulletproof either.  For instance, what if the
two adds go to replica 1 but the delete to replica 2?

Bottom line (and this was discussed on the original
delete-for-counters ticket,
https://issues.apache.org/jira/browse/CASSANDRA-2101), counter deletes
are not fully commutative which makes them fragile.

On Mon, Jun 13, 2011 at 10:54 AM, Yang <te...@gmail.com> wrote:
> as https://issues.apache.org/jira/browse/CASSANDRA-2101
> indicates, the problem with counter delete is  in scenarios like the
> following:
> add 1, clock 100
> delete , clock 200
> add  2 , clock 300
> if the 1st and 3rd operations are merged in SStable compaction, then we
> have
> delete  clock 200
> add 3,  clock 300
> which shows wrong result.
>
> I think a relatively simple extension can be used to complete fix this
> issue: similar to ZooKeeper, we can prefix an "Epoch" number to the clock,
> so that
>    1) a delete operation increases future epoch number by 1
>    2) merging of delta adds can be between only deltas of the same epoch,
> deltas of older epoch are simply ignored during merging. merged result keeps
> the epoch number of the newest seen.
> other operations remain the same as current. note that the above 2 rules are
> only concerned with merging within the deltas on the leader, and not related
> to the replicated count, which is a simple final state, and observes the
> rule of "larger clock trumps". naturally the ordering rule is: epoch1.clock1
>> epoch2.clock2  iff epoch1 > epoch2 || epoch1 == epoch2 && clock1 > clock2
> intuitively "epoch" can be seen as the serial number on a new "incarnation"
> of a counter.
>
> code change should be mostly localized to CounterColumn.reconcile(),
>  although, if an update does not find existing entry in memtable, we need to
> go to sstable to fetch any possible epoch number, so
> compared to current write path, in the "no replicate-on-write" case, we need
> to add a read to sstable. but in the "replicate-on-write" case, we already
> read that, so it's no extra time cost.  "no replicate-on-write" is not a
> very useful setup in reality anyway.
>
> does this sound a feasible way?   if this works, expiring counter should
> also naturally work.
>
> Thanks
> Yang



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com