You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Blake Eggleston (JIRA)" <ji...@apache.org> on 2014/12/01 20:52:13 UTC

[jira] [Commented] (CASSANDRA-6246) EPaxos

    [ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230331#comment-14230331 ] 

Blake Eggleston commented on CASSANDRA-6246:
--------------------------------------------

Since it looks like the performance improvements from epaxos could be worth the (substantial) added complexity, I’ve been thinking through problems are caused by the need to  garbage collect instances, and repair causing inconsistencies by sending data from ‘the future’.

For repair, the only thing I’ve thought of that would work 100% of the time would be to count executed instances for a partition, and to send that count along with the repair request. If the remote count is higher than the local count, we know for sure that it has data from the future, and the repair for that partition should be deferred.

For garbage collection, we’ll need to support a failure recovery mode that works without all historical instances. We also need a way to quickly determine if a prepare phase should be used, or we need a epaxos repair type operation to bring a node up to speed.

Breaking the continuous execution space of partition ranges into discrete epochs would give us a relatively straightforward way of solving all of these problems. Each partition range will have it’s own epoch number. At a given instance number threshold, time threshold, or event, epaxos will run an epoch increment instance. It will take every active instance in it’s partition range as a dependency. Any instance executed before the epoch instance belongs to the last epoch, any executed after belong to the new one.

How this would solve the outstanding problems:

Garbage Collection: Any instance from 2 or more epochs ago can be deleted. Although epoch incrementing instances doesn’t prevent dependencies on the previous epoch, it does prevent dependencies from the previous-1 epoch

Repair: Counting executions allows us to determine if repair data is from the future. Epochs let us scope execution counts to an epoch. If the epoch has incremented twice without new executions for a partition, the bookkeeping data for that partition can be deleted. This gives us a race free way to delete old execution counts, preventing keeping bookkeeping data around forever.

Failure recovery: Using epochs makes deciding to use prepare or failure recovery unambiguous. If a node is missing instances that are from 2 or more epochs ago, it will need to run a failure recovery. Otherwise, prepare phases will work. Additionally, using an epaxos instance as the method of incrementing epochs guarantees that a given instance has been executed once the epoch has been incremented twice.

> EPaxos
> ------
>
>                 Key: CASSANDRA-6246
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Blake Eggleston
>            Priority: Minor
>
> One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies.
> EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
> However, there is substantial additional complexity involved if we choose to implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)