You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Blake Eggleston (JIRA)" <ji...@apache.org> on 2014/09/20 02:10:35 UTC

[jira] [Commented] (CASSANDRA-6246) EPaxos

    [ https://issues.apache.org/jira/browse/CASSANDRA-6246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14141590#comment-14141590 ] 

Blake Eggleston commented on CASSANDRA-6246:
--------------------------------------------

I'm still poring over the discussion in CASSANDRA-5062, and the current implementation, but wanted to expand on some of the advantages, list a few disadvantages and caveats of using egalitarian paxos, and talk about a few areas where we'd probably want to deviate from the process as described in the paper by Moraru et al.

Advantages:
* In the ideal case we should be able to answer a client's query after the same number of inter-node messages it takes to do a quorum write. (There will be more total messages, but we don't need to wait for them to complete before responding to the client)
** This is assuming that each node performs the cas locally instead of using paxos to setup a quorum read/write
* Even in the non-ideal case, you're still looking at 2 network round trips before reaching commit (it looks like current impl has 4 network round trips for cas?)
* Much higher throughput on interfering queries is possible. Multiple in-flight queries on the same row is not a problem.
** livelock is not a risk during normal operation, only during failure recovery. However, this can be mitigated by specifying an order of succession for query leaders. Of course, really heavy 'normal' operation might start causing failure cases.
* Granular control over which operations interfere with each other

Disadvantages:
* the epaxos optimizations are possible because it has a pretty complex failure recovery procedure
* the concurrent programming side of things will be more complicated than the current implementation
* because execution is more asynchronous than classic paxos, I think we'd have to perform the operations locally rather than using paxos to setup a normal quorum read/write. On one hand, this saves us a network round trip. On the other hand, if people are doing non-serialized writes at the same time as serialized writes that affect the same cells, it's likely that different nodes will record different results for a query. Obviously, it's not a good idea to do this, but that doesn't mean people won't do it.

Caveats:
* with rf>3, or a non-replica coordinator, responses from more than a quorum of replicas _may_ be needed to commit on the ideal case. Or we just use the 2 message commit path in those situation. I'm still working out the details, but I'm pretty sure there are failure scenarios where not doing that could result in different values can be committed after recovery.
* Epaxos is pretty new. I was talking to the authors about it a few months ago, and the only implementations we were aware of were mine and theirs... I'm pretty sure there aren't any production deployments of it. That's not _neccesarily_ a bad thing, but I just wanted to point out that we are in fairly new territory, and that should be weighed against the advantages. There is no 'Making EPaxos Live' paper out there.

Places where Cassandra's architecture will likely require doing things a bit differently than outlined in the paper:
* Sequence values will cause problems, but they shouldn't be neccesary.
*# since each node is responsible for different ranges of data, and therefore would have seen different queries, encountering different seq values would be very likely, and would result in a lot of otherwise unnecessary accept phases. We could get around this by using different seq values for different token ranges, but...
*# Since we'd wait until the query is actually executed before returning a result to the client (don't know why we wouldn't), it's a superfluous requirement. I discussed this with Iulian Moraru a few months ago and he agreed.
* Using a non-replica coordinator:
*# The paper assumes that an instance leader is also a replica of the data being queried. I'd imagine we'd want to avoid optimistically forwarding queries to a single replica and hoping it's up, which would mean allowing coordinators to lead queries for keys they don't know anything about. This would prevent the non-leaders from recording that they agree with the leader, preventing some optimizations in failure recovery. It would make a good case for using prepared statements and token aware routing.


> EPaxos
> ------
>
>                 Key: CASSANDRA-6246
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6246
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jonathan Ellis
>            Priority: Minor
>
> One reason we haven't optimized our Paxos implementation with Multi-paxos is that Multi-paxos requires leader election and hence, a period of unavailability when the leader dies.
> EPaxos is a Paxos variant that requires (1) less messages than multi-paxos, (2) is particularly useful across multiple datacenters, and (3) allows any node to act as coordinator: http://sigops.org/sosp/sosp13/papers/p358-moraru.pdf
> However, there is substantial additional complexity involved if we choose to implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)