You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Nicholas Wilson <ni...@realvnc.com> on 2016/02/25 10:23:49 UTC

Handling uncommitted paxos state

Hi,

I have some questions about the behaviour of 'uncommitted paxos state', as described here:

http://www.datastax.com/dev/blog/cassandra-error-handling-done-right

If a WriteTimeoutException with WriteType.SIMPLE is thrown for a CAS write, that means that the paxos phase was successful, but the data couldn't be committed during the final 'commit/reset' phase. On the next SERIAL write or read, any other node can commit the write on behalf of the original proposer, and must do so in fact before forming a new ballot. The stops the columns from getting 'stuck' if the coordinator experiences a network partition after forming the ballot, but before committing.

My questions are on the durability of the uncommitted state:

Suppose CAS writes are infrequent, and it takes weeks before another write is done to that column; will the paxos state still be there, waiting forever until the next commit, or does it get automatically committed during GC if you wait long enough? (I don't see how it could be cleaned up by a GC though, since the nodes holding the paxos state don't know if the ballot was won or not.)

Or, what if all the nodes are switched off (briefly); is the uncommitted paxos state persisted to disk in the log/journal, so the write can still be completed when the cluster comes back online?

Finally, how granular is the paxos state? Will the uncommitted write be completed on the next SERIAL write that touches the same exact combination of cells, or is it per-column across the partition, or....? If the CAS write touches two or three cells in the row, will a subsequent SERIAL read from any one of those three columns complete the uncommitted state, presumably on the other columns as well?

Thanks for your help,
Nick

---
Nick Wilson
Software engineer, RealVNC

Re: Handling uncommitted paxos state

Posted by Robert Coli <rc...@eventbrite.com>.

On Thu, Feb 25, 2016 at 1:23 AM, Nicholas Wilson <
nicholas.wilson@realvnc.com> wrote:

> If a WriteTimeoutException with WriteType.SIMPLE is thrown for a CAS
> write, that means that the paxos phase was successful, but the data
> couldn't be committed during the final 'commit/reset' phase. On the next
> SERIAL write or read, any other node can commit the write on behalf of the
> original proposer, and must do so in fact before forming a new ballot. The
> stops the columns from getting 'stuck' if the coordinator experiences a
> network partition after forming the ballot, but before committing.
>

If you're asking these questions, you probably want to read :

https://issues.apache.org/jira/browse/CASSANDRA-9328

=Rob

Re: Handling uncommitted paxos state

Posted by Carl Yeksigian <ca...@yeksigian.com>.

The paxos state is written to a system table (system.paxos) on each of the
paxos coordinators, so it goes through the normal write path, including
persisting to the log and being stored in a memtable until being flushed to
disk. As such, the state can survive restarts. These states are not treated
differently from our normal memtables, so there isn't any special handling
for a GC.

There is no process which will come in and fix up the values; they are
fixed at a partition level when trying to perform a CAS operation, or when
reading at a SERIAL consistency. This operation happens at the partition,
so if any part of the partition is read of updated, it will finish previous
transactions.

If you want to know more,
http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0
has a lot more information about lightweight transactions.

-Carl

On Thu, Feb 25, 2016 at 4:23 AM, Nicholas Wilson <
nicholas.wilson@realvnc.com> wrote:

> Hi,
>
> I have some questions about the behaviour of 'uncommitted paxos state', as
> described here:
>
> http://www.datastax.com/dev/blog/cassandra-error-handling-done-right
>
> If a WriteTimeoutException with WriteType.SIMPLE is thrown for a CAS
> write, that means that the paxos phase was successful, but the data
> couldn't be committed during the final 'commit/reset' phase. On the next
> SERIAL write or read, any other node can commit the write on behalf of the
> original proposer, and must do so in fact before forming a new ballot. The
> stops the columns from getting 'stuck' if the coordinator experiences a
> network partition after forming the ballot, but before committing.
>
> My questions are on the durability of the uncommitted state:
>
> Suppose CAS writes are infrequent, and it takes weeks before another write
> is done to that column; will the paxos state still be there, waiting
> forever until the next commit, or does it get automatically committed
> during GC if you wait long enough? (I don't see how it could be cleaned up
> by a GC though, since the nodes holding the paxos state don't know if the
> ballot was won or not.)
>
> Or, what if all the nodes are switched off (briefly); is the uncommitted
> paxos state persisted to disk in the log/journal, so the write can still be
> completed when the cluster comes back online?
>
> Finally, how granular is the paxos state? Will the uncommitted write be
> completed on the next SERIAL write that touches the same exact combination
> of cells, or is it per-column across the partition, or....? If the CAS
> write touches two or three cells in the row, will a subsequent SERIAL read
> from any one of those three columns complete the uncommitted state,
> presumably on the other columns as well?
>
> Thanks for your help,
> Nick
>
> ---
> Nick Wilson
> Software engineer, RealVNC