You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Yang <te...@gmail.com> on 2015/08/03 04:30:04 UTC

linearizable consistency / Paxos ?

this link
http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0
talks about linearizable consistency and lightweight transactions.

but I am still not completely following it , just based on the article
itself.

the standard replication protocol in Cassandra does establish a total order
(based on client TS, though that can be wrong/unfair),  so in the case of
the example mentioned in the article "if 2 people try to create the same
account', yes if both of them just brute-force write, ultimately we will
have one winner, who provided the higher TS (this is done consistently
across all replicas).

what really matters in the above situation is the ability to group the 2
operations "check existing account" and "create account" together and run
them in an atomic way.  so we need something like a 2-phase commit.

I guess what is not clear from that article is , what is the fundamental
difference between the standard replication protocol and Paxos that
prevents us from implementing a 2-pc on top of the standard protocol?

Thanks!
yang

Re: linearizable consistency / Paxos ?

Posted by Yang <te...@gmail.com>.
Thanks a lot for the info!


I see,  the paxos protocol used now in the code is actually the
"single-decree synod"  protocol, which votes on only one value.

the scope of the implementation is only the CAS operation (which contains a
read and write), not a generic txn (which could contain arbitrarily many
operations).

 for the generic txn the multi-degree protocol would be needed. here CAS is
able to work on top of the synod because the read is essentially
"sandwitched/bounded" between the proposal and accept, so that no other
ballot can get in between (the line
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageProxy.java#L273
checks this).

On Mon, Aug 3, 2015 at 4:29 AM, DuyHai Doan <do...@gmail.com> wrote:

> "you seem to be suggesting that the "other operations on the same
> partition key have to wait" because Paxos grouped the first series
> together, which have to be committed in the same order , before all other
> operations, essentially ___serializing___ the operations (with guaranteed
> same order)."  --> No, the implementation does not group any Paxos
> operation together. And when I said about (INSERT, UPDATE, DELETE ...) I
> didn't mean a group of operations, just individual INSERT or UPDATE or
> DELETE operation that can occur at any moment.
>
> Indeed there are 3 scenarios for dueling proposals P1 & P2:
>
> 1. P1 has not been accepted yet and P2 has higher ballot than P1, then P1
> will abort, sleep for a random amount of time and re-propose later. This is
> in order to give P2 a chance to complete its Paxos round.
>
> 2. P1 has been accepted (phase Propose/Accept successful) and P2 has
> higher ballot than P1, then the coordinator that issued P2 has to commit P1
> first before re-proposing P2
>
> 3. P2 has lower ballot than P1, then P2 is rejected at Prepare/Promise
> phase
>
> "I guess Cassandra must be doing something to prevent "the second guy
> injecting his operation before DELETE" in the above scenario" --> Since
> there is no grouping of Paxos operations (not to be confused with BATCH
> statement with one Paxos operation!), C* does nothing to prevent a second
> client to issue a PAXOS between the UPDATE and DELETE.
>
> If you're interested, you can look into the source code here:
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageProxy.java#L202
>
> The Javadoc is also interesting to read because it explains briefly the
> semantics
>
>
>
> On Mon, Aug 3, 2015 at 11:32 AM, Yang <te...@gmail.com> wrote:
>
>> thanks for your answer DuyHai.
>>
>> I understand Paxos. but I think your description seems missing one
>> important point: in the example you gave, "a series of ongoing operation
>> (INSERT, UPDATE , DELETE ...) " you seem to be suggesting that the "other
>> operations on the same partition key have to wait" because Paxos grouped
>> the first series together, which have to be committed in the same order ,
>> before all other operations, essentially ___serializing___ the operations
>> (with guaranteed same order).
>>
>> but Paxos ONLY guarantees order of the operations as they are proposed.
>> Paxos itself can not control when a operation is proposed. for example in
>> the above sequence . INSERT, UPDATE , DELETE,.... the second guy is fully
>> allowed to propose his operation (say another UPDATE) before DELETE is
>> proposed, and hence get the latest ballot number (smaller than that for
>> DELETE), so the final committed sequence is INSERT UPDATE
>> op_from_another_guy, DELETE ......
>>
>> I guess Cassandra must be doing something to prevent "the second guy
>> injecting his operation before DELETE" in the above scenario, that seems to
>> be some transaction manager which is not yet clearly described in the
>> slides u gave.
>>
>> if that is correct,
>> my point is, if we let  the above transaction manager work with the
>> standard replication protocol, don't we also get transaction behavior?
>>
>>
>> On Mon, Aug 3, 2015 at 12:14 AM, DuyHai Doan <do...@gmail.com>
>> wrote:
>>
>>> "what is the fundamental difference between the standard replication
>>> protocol and Paxos that prevents us from implementing a 2-pc on top of the
>>> standard protocol?"
>>>
>>> --> for a more detailed description of Paxos, look here:
>>> http://www.slideshare.net/doanduyhai/distributed-algorithms-for-big-data-geecon/41
>>>
>>> Long story short, when there is an ongoing operation (INSERT, UPDATE,
>>> DELETE, ...) on a particular partition key with Paxos, any other concurrent
>>> operation on the same partition key will have to wait until the ongoing
>>> operation commits.
>>>
>>> If the ongoing operation is validated by Paxos but fails before being
>>> able to commit (after the accept phase in the diagram), then any subsequent
>>> operation on this partition key will commit this stalled operation before
>>> starting its own.
>>>
>>>
>>>
>>> On Mon, Aug 3, 2015 at 4:30 AM, Yang <te...@gmail.com> wrote:
>>>
>>>> this link
>>>> http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0
>>>> talks about linearizable consistency and lightweight transactions.
>>>>
>>>> but I am still not completely following it , just based on the article
>>>> itself.
>>>>
>>>> the standard replication protocol in Cassandra does establish a total
>>>> order (based on client TS, though that can be wrong/unfair),  so in the
>>>> case of the example mentioned in the article "if 2 people try to create the
>>>> same account', yes if both of them just brute-force write, ultimately we
>>>> will have one winner, who provided the higher TS (this is done consistently
>>>> across all replicas).
>>>>
>>>> what really matters in the above situation is the ability to group the
>>>> 2 operations "check existing account" and "create account" together and run
>>>> them in an atomic way.  so we need something like a 2-phase commit.
>>>>
>>>> I guess what is not clear from that article is , what is the
>>>> fundamental difference between the standard replication protocol and Paxos
>>>> that prevents us from implementing a 2-pc on top of the standard protocol?
>>>>
>>>> Thanks!
>>>> yang
>>>>
>>>
>>>
>>
>

Re: linearizable consistency / Paxos ?

Posted by DuyHai Doan <do...@gmail.com>.
"you seem to be suggesting that the "other operations on the same partition
key have to wait" because Paxos grouped the first series together, which
have to be committed in the same order , before all other operations,
essentially ___serializing___ the operations (with guaranteed same order)."
 --> No, the implementation does not group any Paxos operation together.
And when I said about (INSERT, UPDATE, DELETE ...) I didn't mean a group of
operations, just individual INSERT or UPDATE or DELETE operation that can
occur at any moment.

Indeed there are 3 scenarios for dueling proposals P1 & P2:

1. P1 has not been accepted yet and P2 has higher ballot than P1, then P1
will abort, sleep for a random amount of time and re-propose later. This is
in order to give P2 a chance to complete its Paxos round.

2. P1 has been accepted (phase Propose/Accept successful) and P2 has higher
ballot than P1, then the coordinator that issued P2 has to commit P1 first
before re-proposing P2

3. P2 has lower ballot than P1, then P2 is rejected at Prepare/Promise phase

"I guess Cassandra must be doing something to prevent "the second guy
injecting his operation before DELETE" in the above scenario" --> Since
there is no grouping of Paxos operations (not to be confused with BATCH
statement with one Paxos operation!), C* does nothing to prevent a second
client to issue a PAXOS between the UPDATE and DELETE.

If you're interested, you can look into the source code here:
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageProxy.java#L202

The Javadoc is also interesting to read because it explains briefly the
semantics



On Mon, Aug 3, 2015 at 11:32 AM, Yang <te...@gmail.com> wrote:

> thanks for your answer DuyHai.
>
> I understand Paxos. but I think your description seems missing one
> important point: in the example you gave, "a series of ongoing operation
> (INSERT, UPDATE , DELETE ...) " you seem to be suggesting that the "other
> operations on the same partition key have to wait" because Paxos grouped
> the first series together, which have to be committed in the same order ,
> before all other operations, essentially ___serializing___ the operations
> (with guaranteed same order).
>
> but Paxos ONLY guarantees order of the operations as they are proposed.
> Paxos itself can not control when a operation is proposed. for example in
> the above sequence . INSERT, UPDATE , DELETE,.... the second guy is fully
> allowed to propose his operation (say another UPDATE) before DELETE is
> proposed, and hence get the latest ballot number (smaller than that for
> DELETE), so the final committed sequence is INSERT UPDATE
> op_from_another_guy, DELETE ......
>
> I guess Cassandra must be doing something to prevent "the second guy
> injecting his operation before DELETE" in the above scenario, that seems to
> be some transaction manager which is not yet clearly described in the
> slides u gave.
>
> if that is correct,
> my point is, if we let  the above transaction manager work with the
> standard replication protocol, don't we also get transaction behavior?
>
>
> On Mon, Aug 3, 2015 at 12:14 AM, DuyHai Doan <do...@gmail.com> wrote:
>
>> "what is the fundamental difference between the standard replication
>> protocol and Paxos that prevents us from implementing a 2-pc on top of the
>> standard protocol?"
>>
>> --> for a more detailed description of Paxos, look here:
>> http://www.slideshare.net/doanduyhai/distributed-algorithms-for-big-data-geecon/41
>>
>> Long story short, when there is an ongoing operation (INSERT, UPDATE,
>> DELETE, ...) on a particular partition key with Paxos, any other concurrent
>> operation on the same partition key will have to wait until the ongoing
>> operation commits.
>>
>> If the ongoing operation is validated by Paxos but fails before being
>> able to commit (after the accept phase in the diagram), then any subsequent
>> operation on this partition key will commit this stalled operation before
>> starting its own.
>>
>>
>>
>> On Mon, Aug 3, 2015 at 4:30 AM, Yang <te...@gmail.com> wrote:
>>
>>> this link
>>> http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0
>>> talks about linearizable consistency and lightweight transactions.
>>>
>>> but I am still not completely following it , just based on the article
>>> itself.
>>>
>>> the standard replication protocol in Cassandra does establish a total
>>> order (based on client TS, though that can be wrong/unfair),  so in the
>>> case of the example mentioned in the article "if 2 people try to create the
>>> same account', yes if both of them just brute-force write, ultimately we
>>> will have one winner, who provided the higher TS (this is done consistently
>>> across all replicas).
>>>
>>> what really matters in the above situation is the ability to group the 2
>>> operations "check existing account" and "create account" together and run
>>> them in an atomic way.  so we need something like a 2-phase commit.
>>>
>>> I guess what is not clear from that article is , what is the fundamental
>>> difference between the standard replication protocol and Paxos that
>>> prevents us from implementing a 2-pc on top of the standard protocol?
>>>
>>> Thanks!
>>> yang
>>>
>>
>>
>

Re: linearizable consistency / Paxos ?

Posted by Yang <te...@gmail.com>.
thanks for your answer DuyHai.

I understand Paxos. but I think your description seems missing one
important point: in the example you gave, "a series of ongoing operation
(INSERT, UPDATE , DELETE ...) " you seem to be suggesting that the "other
operations on the same partition key have to wait" because Paxos grouped
the first series together, which have to be committed in the same order ,
before all other operations, essentially ___serializing___ the operations
(with guaranteed same order).

but Paxos ONLY guarantees order of the operations as they are proposed.
Paxos itself can not control when a operation is proposed. for example in
the above sequence . INSERT, UPDATE , DELETE,.... the second guy is fully
allowed to propose his operation (say another UPDATE) before DELETE is
proposed, and hence get the latest ballot number (smaller than that for
DELETE), so the final committed sequence is INSERT UPDATE
op_from_another_guy, DELETE ......

I guess Cassandra must be doing something to prevent "the second guy
injecting his operation before DELETE" in the above scenario, that seems to
be some transaction manager which is not yet clearly described in the
slides u gave.

if that is correct,
my point is, if we let  the above transaction manager work with the
standard replication protocol, don't we also get transaction behavior?


On Mon, Aug 3, 2015 at 12:14 AM, DuyHai Doan <do...@gmail.com> wrote:

> "what is the fundamental difference between the standard replication
> protocol and Paxos that prevents us from implementing a 2-pc on top of the
> standard protocol?"
>
> --> for a more detailed description of Paxos, look here:
> http://www.slideshare.net/doanduyhai/distributed-algorithms-for-big-data-geecon/41
>
> Long story short, when there is an ongoing operation (INSERT, UPDATE,
> DELETE, ...) on a particular partition key with Paxos, any other concurrent
> operation on the same partition key will have to wait until the ongoing
> operation commits.
>
> If the ongoing operation is validated by Paxos but fails before being able
> to commit (after the accept phase in the diagram), then any subsequent
> operation on this partition key will commit this stalled operation before
> starting its own.
>
>
>
> On Mon, Aug 3, 2015 at 4:30 AM, Yang <te...@gmail.com> wrote:
>
>> this link
>> http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0
>> talks about linearizable consistency and lightweight transactions.
>>
>> but I am still not completely following it , just based on the article
>> itself.
>>
>> the standard replication protocol in Cassandra does establish a total
>> order (based on client TS, though that can be wrong/unfair),  so in the
>> case of the example mentioned in the article "if 2 people try to create the
>> same account', yes if both of them just brute-force write, ultimately we
>> will have one winner, who provided the higher TS (this is done consistently
>> across all replicas).
>>
>> what really matters in the above situation is the ability to group the 2
>> operations "check existing account" and "create account" together and run
>> them in an atomic way.  so we need something like a 2-phase commit.
>>
>> I guess what is not clear from that article is , what is the fundamental
>> difference between the standard replication protocol and Paxos that
>> prevents us from implementing a 2-pc on top of the standard protocol?
>>
>> Thanks!
>> yang
>>
>
>

Re: linearizable consistency / Paxos ?

Posted by DuyHai Doan <do...@gmail.com>.
"what is the fundamental difference between the standard replication
protocol and Paxos that prevents us from implementing a 2-pc on top of the
standard protocol?"

--> for a more detailed description of Paxos, look here:
http://www.slideshare.net/doanduyhai/distributed-algorithms-for-big-data-geecon/41

Long story short, when there is an ongoing operation (INSERT, UPDATE,
DELETE, ...) on a particular partition key with Paxos, any other concurrent
operation on the same partition key will have to wait until the ongoing
operation commits.

If the ongoing operation is validated by Paxos but fails before being able
to commit (after the accept phase in the diagram), then any subsequent
operation on this partition key will commit this stalled operation before
starting its own.



On Mon, Aug 3, 2015 at 4:30 AM, Yang <te...@gmail.com> wrote:

> this link
> http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0
> talks about linearizable consistency and lightweight transactions.
>
> but I am still not completely following it , just based on the article
> itself.
>
> the standard replication protocol in Cassandra does establish a total
> order (based on client TS, though that can be wrong/unfair),  so in the
> case of the example mentioned in the article "if 2 people try to create the
> same account', yes if both of them just brute-force write, ultimately we
> will have one winner, who provided the higher TS (this is done consistently
> across all replicas).
>
> what really matters in the above situation is the ability to group the 2
> operations "check existing account" and "create account" together and run
> them in an atomic way.  so we need something like a 2-phase commit.
>
> I guess what is not clear from that article is , what is the fundamental
> difference between the standard replication protocol and Paxos that
> prevents us from implementing a 2-pc on top of the standard protocol?
>
> Thanks!
> yang
>