You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Branimir Lambov (Jira)" <ji...@apache.org> on 2022/01/20 08:46:00 UTC

[jira] [Comment Edited] (CASSANDRA-17164) CEP-14: Paxos Improvements

    [ https://issues.apache.org/jira/browse/CASSANDRA-17164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479167#comment-17479167 ] 

Branimir Lambov edited comment on CASSANDRA-17164 at 1/20/22, 8:45 AM:
-----------------------------------------------------------------------

{quote}I think this is actually the typical way of implementing Classic Paxos, even though Lamport's paper seems to suggest you must only contact the nodes that responded to the prepare (there may be something else specific about his formulation that necessitates this, I forget, as I dislike his writings on the topic). This is corroborated by Heidi Howard's dissertation, which was the easiest place I could find a straight-forward formulation of Classic Paxos besides that of Lamport. See Algorithm 3 on Page 30.
{quote}
This is an excellent pointer, thank you. Lamport's proof also works with separate proposal quorum – my only request here is that this should be stated somewhere as other contributors might start with Lamport's original definition too.
{quote}I don't quite follow what you mean by this, as this is not limited to "most recent commit", but a ballot directly maps to the instance id of classic paxos, it just avoids pre-splitting the range of integers.
{quote}
Well, I don't follow this one. A ballot id is something one replica selects and sends it as a prepare message to everyone. At this point some nodes may have committed a proposal, others may have accepted it, and yet others may have never heard of it. From the point of view of a sequence of paxos instances, the first two will definitely make promises for different instances, and likely the third one will be promising for yet another one. The fact that we tie accepting a promise to a majority with matching "most recent commit" means that we effectively treat it as the identifier of the current session.

More interestingly, we can then send a proposal to nodes with older most recent commit (i.e. nodes that are effectively working on a previous paxos instance), that proposal will be accepted (overwriting the decided value but not advancing the most recent commit), and this might lead to a decision using just a small number of participants with the right "most recent commit". Specifically, what stops the following from happening?
{code:java}
                       A mrc    A accepted    A promised             B mrc   B accepted   B promised             C mrc    C accepted   C promised

prepare(1) -> ABC        -           -            1                    -           -           1                   -          -            1
propose(1, X) -> ABC     -         1,X            1                    -         1,X           1                   -         1,X           1
commit(1, X) -> AB      1,X          -            1                   1,X          -           1                   -          -            1

prepare(2) -> AB        1,X          -            2                   1,X          -           2                   -         1,X           1                      majority AB, no refresh
propose(2, Y) -> BC     1,X          -            2                   1,X        2,Y           2                   -         2,Y           2                      returns successful Y

prepare(3) -> AC        1,X        1,X            3                   1,X        2,Y           2                  1,X         -            3                      refresh stale, then forms majority AC
propose(3, Z) -> ABC    1,X        3,Z            3                   1,X        3,Z           3                  1,X        3,Z           3                      returns successful Z   
{code}
If this is permitted, there's a further error possibility if B is partitioned after the 2,Y proposal succeeds and a commit is executed in B. Then the committed value can be read from B, and later wiped with a conflicting decision.

In LWT terms, if 
X: {{{}column = X if not exists{}}}, 
Y: {{column = Y if column == X}} and 
Z: {{{}column = Z if column == X{}}},
we may be able to read (serially) both Y and Z.
{quote}Could you explain what you are referring to here? I think this is all standard stuff for Paxos, we're again just recording the most recently used instance number for each register.
{quote}
Judging from the above, this may be insufficient. I'm likely missing something here – that something needs to be documented.
{quote}The final commit phase is only required to ensure any "decree" (decision) is disseminated. If we have proposed that no decree be made, there is nothing to disseminate, and nothing to complete if another transaction encounters it. This is in some ways an artefact of the feature of Cassandra's implementation, that we initiate a paxos round without knowing if it will do anything, though this feature would I suppose be present for read-only operations anyway.
{quote}
This is also hard to read from the code: {{Paxos.cas}} does not issue a commit for an empty proposal, but it seems the next {{prepare}} will return {{FOUND_INCOMPLETE_ACCEPTED}} and trigger reproposal and commit. This will work correctly, but perform one roundtrip more than simply committing when the agreement was reached. What am I missing?


was (Author: blambov):
bq. I think this is actually the typical way of implementing Classic Paxos, even though Lamport's paper seems to suggest you must only contact the nodes that responded to the prepare (there may be something else specific about his formulation that necessitates this, I forget, as I dislike his writings on the topic). This is corroborated by Heidi Howard's dissertation, which was the easiest place I could find a straight-forward formulation of Classic Paxos besides that of Lamport. See Algorithm 3 on Page 30.

This is an excellent pointer, thank you. Lamport's proof also works with separate proposal quorum -- my only request here is that this should be stated somewhere as other contributors might start with Lamport's original definition too.

bq. I don't quite follow what you mean by this, as this is not limited to "most recent commit", but a ballot directly maps to the instance id of classic paxos, it just avoids pre-splitting the range of integers.

Well, I don't follow this one. A ballot id is something one replica selects and sends it as a prepare message to everyone. At this point some nodes may have committed a proposal, others may have accepted it, and yet others may have never heard of it. From the point of view of a sequence of paxos instances, the first two will definitely make promises for different instances, and likely the third one will be promising for yet another one. The fact that we tie accepting a promise to a majority with matching "most recent commit" means that we effectively treat it as the identifier of the current session.

More interestingly, we can then send a proposal to nodes with older most recent commit (i.e. nodes that are effectively working on a previous paxos instance), that proposal will be accepted (overwriting the decided value but not advancing the most recent commit), and this might lead to a decision using just a small number of participants with the right "most recent commit". Specifically, what stops the following from happening?

{code}
                       A mrc    A accepted    A promised             B mrc   B accepted   B promised             C mrc    C accepted   C promised

prepare(1) -> ABC        -           -            1                    -           -           1                   -          -            1
propose(1, X) -> ABC     -         1,X            1                    -         1,X           1                   -         1,X           1
commit(1, X) -> AB      1,X          -            1                   1,X          -           1                   -          -            1

prepare(2) -> AB        1,X          -            2                   1,X          -           2                   -         1,X           1                      majority AB, no refresh
propose(2, Y) -> BC     1,X          -            2                   1,X        2,Y           2                   -         2,Y           2                      returns successful Y

prepare(3) -> AC        1,X        1,X            3                   1,X        2,Y           2                  1,X         -            3                      refresh stale, then forms majority AC
propose(3, Z) -> ABC    1,X        3,Z            3                   1,X        3,Z           3                  1,X        3,Z           3                      returns successful Z   
{code}

If this is permitted, there's a further error possibility if B is partitioned after the 2,Y proposal succeeds and a commit is executed in B. Then the committed value can be read from B, and later wiped with a conflicting decision. 

In LWT terms, if 
X: {{column = X if not exists}}, 
Y: {{column = Y if column == X}} and 
Z: {{column = Z if column == X}},
we may be able to read both Y and Z.

bq. Could you explain what you are referring to here? I think this is all standard stuff for Paxos, we're again just recording the most recently used instance number for each register.

Judging from the above, this may be insufficient. I'm likely missing something here -- that something needs to be documented.

bq. The final commit phase is only required to ensure any "decree" (decision) is disseminated. If we have proposed that no decree be made, there is nothing to disseminate, and nothing to complete if another transaction encounters it. This is in some ways an artefact of the feature of Cassandra's implementation, that we initiate a paxos round without knowing if it will do anything, though this feature would I suppose be present for read-only operations anyway.

This is also hard to read from the code: {{Paxos.cas}} does not issue a commit for an empty proposal, but it seems the next {{prepare}} will return {{FOUND_INCOMPLETE_ACCEPTED}} and trigger reproposal and commit. This will work correctly, but perform one roundtrip more than simply committing when the agreement was reached. What am I missing?

> CEP-14: Paxos Improvements
> --------------------------
>
>                 Key: CASSANDRA-17164
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17164
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Consistency/Coordination, Consistency/Repair
>            Reporter: Benedict Elliott Smith
>            Assignee: Benedict Elliott Smith
>            Priority: Normal
>             Fix For: 4.1
>
>
> This ticket encompasses work for [CEP-14|
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-14%3A+Paxos+Improvements].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org