You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Peter Schuller (Created) (JIRA)" <ji...@apache.org> on 2012/02/17 09:48:00 UTC

[jira] [Created] (CASSANDRA-3927) demystify failure detector, consider partial failure handling, latency optimizations

demystify failure detector, consider partial failure handling, latency optimizations
------------------------------------------------------------------------------------

Key: CASSANDRA-3927
URL: https://issues.apache.org/jira/browse/CASSANDRA-3927
Project: Cassandra
Issue Type: Wish
Reporter: Peter Schuller
Assignee: Peter Schuller
Priority: Minor

[My aim with this ticket is to explain my current understanding of the FD and it's behavior, express some opinions, and invite others to let me know if I'm misunderstanding something.]

So I was getting back to CASSANDRA-3569 and I re-read the ticket history, and I want to add a few things that are more about the FD in general, that I've realized since the last batch of discussion.

Firstly, as a result of investigating gossip more, and of reading CASSANDRA-2597 (Paul Cannon's excellent break-down of what the FD actually does mathematically - thank you Paul!), I now have a much better understanding of the behavior of the failure detector than I did before. Unfortunately for the failure detector. Under the premise that the break-down in CASSANDRA-2597 (and the resulting commit to the behavior of Cassandra) is correct, if we ignore all the guassian/normal distribution stuff (I openly admit I lack the necessary math background for a rigorous analysis), the behavior of the failure detector is the following (not a quote despite use of quote syntax, I'm speaking):

{quote}
For a given node, keep track of the last 1000 intervals between heartbeats received via gossip (indirectly or directly from the node, doesn't matter). At any given moment, the phi "score" of the node is the *time since the last heartbeat divided by the average time between heartbeats over the past 1000 intervals* (scaled by some constant factor which is essentially ignoreable, since it is equivalent of the equivalent adjustment to convict threshold). If it goes above a certain value, we consider it down. We check for this on every gossip round (meaning about every second).
{quote}

I want to re-state the behavior like this because it makes it trivial to get an intuitive understanding of what it does without dwelwing into the AFD paper.

In addition, consider that we the failure detector will *immediately* consider a node up when it receives a new heartbeat and if the node is considered down at the time.

Further, while the accural FD paper talks about giving a non-binary score, and we do this, we don't actually use it except to trigger a binary up/down flap.

Given the above, and general background, it seems to me that:

* In terms of detecting that something goes down, the FD is just barely one step above just slapping a fixed timeout on heartbeats; essentially a timeout scaled relative to average historic latency.
** But on the other hand, it's also fuzzed (relative to simple tcp timeouts) due to o variation in gossip propagation time which is almost certainly higher than the variation in network latency between two nodes.
* The gist of the previous two items is that the FD is really not doing anything advanced/magic or otherwise "opaque".
* In addition, because any heartbeat from a down:ed node implies that it goes up immediately, the failure detector has very little ability to effectively do something "non-trivial" to deal with partial failures, such as demanding that a flapping node show itself healthy for a while before going back up.
** Indeed, as far as I can tell, if a node is slowly growing worse in heartbeat it might never get marked as down - if the rate of worsening is slow enough you'll just slowly scale the past latency history and never hit the threshold. (Untested/unconfirmed)
* The FD is oblivious to input from real traffic (2000 000 messages backed up? doesn't matter, it's "up", even though neighboring nodes have nothing backed up). This is not necessarily wrong in any way, but it needs to be kept in mind when considering what to do *with* the FD.

Now, CASSANDRA-3910 was recently filed where there is an attempt to use the FD for what I personally think is better dealt with in the request path (see that ticket).

In seems to me that the FD as it works now is *definitely* not suitable to handle partial failures or smoothly redirecting traffic from anywhere, since we are converting the output of the underlying FD algo to a binary up/down state. Further even if we directly propagated current phi and used it to do relative weighting on sending requests, we still have the instant return to low phi on the very next heartbeat. It is just not suitable, as currently implemented, for anything other than binary up/down flagging as far as I can tell.

This re-enforces, in my view, my skepticism towards CASSANDRA-2433, and my position in CASSANDRA-3294. We seem to be treating the failure detector as if it's doing something non-trivial, where in reality it isn't (gossip as a communication channel may be non-trivial, but the FD algorithm isn't). It seems to me the main function of the failure detector is that it allows us to scale to very large clusters; we need not to full-mesh ping-pong (whether at app level or at socket level) in order for up/down state to be communicated. This is a useful feature. However, it is actually a result of using gossip as the underlying channel, rather than due to anything specific to the FD algorithm (except insofar as the FD algorithm doesn't need more detailed or specific information than is reasonable to communicate over gossip).

I believe that the FD *is* good at:

* Efficiently (in terms of scaling to large clusters) allowing controlled shutdowns and "official" up/down marking.
** Useful for e.g. selecting hosts for streaming - but only for it's usefulness as "official flag" marking, not in the sense that it detects failure conditions.
** Nodes coming up from scratch can get a feel for the cluster immediately without having to go through an initial burst of failed messages (but currently we don't actually guarantee this anyway because we don't wait for gossip to settle on start-up - but that's a separate issue).

I believe that the FD is *not* good at:

* Indicating whether to send a message to a node, or how often, except for the special case of "the node is completely down, don't even bother at all, ever".
* Optimizing for request latency, at all. It is not an effective tool to mitigate request latencies.
* Optimizing for avoiding heap growth as requests back up; this is a very separate concern that should take into account things like relative queue sizes, past history of real requests, etc.
* High latencies, node hiccups, congestion.

I think that the things it is good at, is a legitimate job for the FD. The things I list under bad, I think is the job of other things like CASSANDRA-2540 and CASSANDRA-3294 to add intelligence to the way we handle requests. Even an infinitely smart FD will never work for these tasks, as long as we retain the binary up/down output (unless you want to start talking flapping up/down at high frequency. I also think that with a sufficiently good request path, we probably don't even need any active ping-pong at all, except maybe very seldom if we don't send any traffic to the node. So while using gossip for this is a cool application of it, I am unconvinced that we need active ping/pong to any great extent if we are sufficiently good at not routing requests to nodes of unknown status, and instantly prefer nodes that are actively responding to those who don't (e.g. least pending request input to routing).

Finally, there is one additional particularly interesting feature of the failure detector and its integration with gossip the way it works now that I am aware of and feel is worth mentioning: My understanding is that the intention is for the FD, due to coupling with gossip, to guarantee that we never send messages to a node whose impact on the token ring is somehow "incorrect". So for example, on a cluster partition you're supposed to be able to make topological changes during a partition and it will just magically resolve itself when you de-partition. (I would *personally* never ever trust this in production; others may disagree. But theoretically, if it works, it's a cool feature.)

Thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3927) demystify failure detector, consider partial failure handling, latency optimizations

Posted by "Sylvain Lebresne (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210185#comment-13210185 ] 

Sylvain Lebresne commented on CASSANDRA-3927:
---------------------------------------------

bq. Thoughts?

Apology for being rude but JIRA is for bug and feature/improvements tracking.  What is the concrete proposal of this ticket?

Also, I don't think anyone mystify the FD (I certainly don't), nor dispute the fact that latency could and should be optimized. Please don't assume we're stupid, but rather recall that 1) we don't have infinite time to do every possible improvements, but we certainly appreciate help, 2) while the FD may not be the perfect solution for every task we use it for, I'm not under the impression that it's everyone's number 1 problem in using Cassandra (so replacing it, even if only for some usage, hasn't been our top priority so far) and 3) we are aware of those things that can be improved already since almost all the "proposition" you listed either have a ticket or have at least be discussed in some ticket.

So yes, I agree with you that failure detection could be improved and that some of those improvements probably involve using different means that the FD for some tasks.  Do feel free to propose patches.

But some people (at least me) try to follow what's going on in JIRA and that kind of very long ticket is just making it harder. It's honestly not the first ticket where I feel that you're writing everything that pops into your mind and imho a good chunk of it could be interesting in a forum but is not JIRA material.

Hoping you won't take this the wrong way.

                
> demystify failure detector, consider partial failure handling, latency optimizations
> ------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3927
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3927
>             Project: Cassandra
>          Issue Type: Wish
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>            Priority: Minor
>
> [My aim with this ticket is to explain my current understanding of the FD and it's behavior, express some opinions, and invite others to let me know if I'm misunderstanding something.]
> So I was getting back to CASSANDRA-3569 and I re-read the ticket history, and I want to add a few things that are more about the FD in general, that I've realized since the last batch of discussion.
> Firstly, as a result of investigating gossip more, and of reading CASSANDRA-2597 (Paul Cannon's excellent break-down of what the FD actually does mathematically - thank you Paul!), I now have a much better understanding of the behavior of the failure detector than I did before. Unfortunately for the failure detector. Under the premise that the break-down in CASSANDRA-2597 (and the resulting commit to the behavior of Cassandra) is correct, if we ignore all the guassian/normal distribution stuff (I openly admit I lack the necessary math background for a rigorous analysis), the behavior of the failure detector is the following (not a quote despite use of quote syntax, I'm speaking):
> {quote}
> For a given node, keep track of the last 1000 intervals between heartbeats received via gossip (indirectly or directly from the node, doesn't matter). At any given moment, the phi "score" of the node is the *time since the last heartbeat divided by the average time between heartbeats over the past 1000 intervals* (scaled by some constant factor which is essentially ignoreable, since it is equivalent of the equivalent adjustment to convict threshold). If it goes above a certain value, we consider it down. We check for this on every gossip round (meaning about every second).
> {quote}
> I want to re-state the behavior like this because it makes it trivial to get an intuitive understanding of what it does without dwelwing into the AFD paper.
> In addition, consider that we the failure detector will *immediately* consider a node up when it receives a new heartbeat and if the node is considered down at the time.
> Further, while the accural FD paper talks about giving a non-binary score, and we do this, we don't actually use it except to trigger a binary up/down flap.
> Given the above, and general background, it seems to me that:
> * In terms of detecting that something goes down, the FD is just barely one step above just slapping a fixed timeout on heartbeats; essentially a timeout scaled relative to average historic latency.
> ** But on the other hand, it's also fuzzed (relative to simple tcp timeouts) due to o variation in gossip propagation time which is almost certainly higher than the variation in network latency between two nodes.
> * The gist of the previous two items is that the FD is really not doing anything advanced/magic or otherwise "opaque".
> * In addition, because any heartbeat from a down:ed node implies that it goes up immediately, the failure detector has very little ability to effectively do something "non-trivial" to deal with partial failures, such as demanding that a flapping node show itself healthy for a while before going back up.
> ** Indeed, as far as I can tell, if a node is slowly growing worse in heartbeat it might never get marked as down - if the rate of worsening is slow enough you'll just slowly scale the past latency history and never hit the threshold. (Untested/unconfirmed)
> * The FD is oblivious to input from real traffic (2000 000 messages backed up? doesn't matter, it's "up", even though neighboring nodes have nothing backed up). This is not necessarily wrong in any way, but it needs to be kept in mind when considering what to do *with* the FD.
> Now, CASSANDRA-3910 was recently filed where there is an attempt to use the FD for what I personally think is better dealt with in the request path (see that ticket).
> In seems to me that the FD as it works now is *definitely* not suitable to handle partial failures or smoothly redirecting traffic from anywhere, since we are converting the output of the underlying FD algo to a binary up/down state. Further even if we directly propagated current phi and used it to do relative weighting on sending requests, we still have the instant return to low phi on the very next heartbeat. It is just not suitable, as currently implemented, for anything other than binary up/down flagging as far as I can tell.
> This re-enforces, in my view, my skepticism towards CASSANDRA-2433, and my position in CASSANDRA-3294. We seem to be treating the failure detector as if it's doing something non-trivial, where in reality it isn't (gossip as a communication channel may be non-trivial, but the FD algorithm isn't). It seems to me the main function of the failure detector is that it allows us to scale to very large clusters; we need not to full-mesh ping-pong (whether at app level or at socket level) in order for up/down state to be communicated. This is a useful feature. However, it is actually a result of using gossip as the underlying channel, rather than due to anything specific to the FD algorithm (except insofar as the FD algorithm doesn't need more detailed or specific information than is reasonable to communicate over gossip).
> I believe that the FD *is* good at:
> * Efficiently (in terms of scaling to large clusters) allowing controlled shutdowns and "official" up/down marking.
> ** Useful for e.g. selecting hosts for streaming - but only for it's usefulness as "official flag" marking, not in the sense that it detects failure conditions.
> ** Nodes coming up from scratch can get a feel for the cluster immediately without having to go through an initial burst of failed messages (but currently we don't actually guarantee this anyway because we don't wait for gossip to settle on start-up - but that's a separate issue).
> I believe that the FD is *not* good at:
> * Indicating whether to send a message to a node, or how often, except for the special case of "the node is completely down, don't even bother at all, ever".
> * Optimizing for request latency, at all. It is not an effective tool to mitigate request latencies.
> * Optimizing for avoiding heap growth as requests back up; this is a very separate concern that should take into account things like relative queue sizes, past history of real requests, etc.
> * High latencies, node hiccups, congestion.
> I think that the things it is good at, is a legitimate job for the FD. The things I list under bad, I think is the job of other things like CASSANDRA-2540 and CASSANDRA-3294 to add intelligence to the way we handle requests. Even an infinitely smart FD will never work for these tasks, as long as we retain the binary up/down output (unless you want to start talking flapping up/down at high frequency. I also think that with a sufficiently good request path, we probably don't even need any active ping-pong at all, except maybe very seldom if we don't send any traffic to the node. So while using gossip for this is a cool application of it, I am unconvinced that we need active ping/pong to any great extent if we are sufficiently good at not routing requests to nodes of unknown status, and instantly prefer nodes that are actively responding to those who don't (e.g. least pending request input to routing).
> Finally, there is one additional particularly interesting feature of the failure detector and its integration with gossip the way it works now that I am aware of and feel is worth mentioning: My understanding is that the intention is for the FD, due to coupling with gossip, to guarantee that we never send messages to a node whose impact on the token ring is somehow "incorrect". So for example, on a cluster partition you're supposed to be able to make topological changes during a partition and it will just magically resolve itself when you de-partition. (I would *personally* never ever trust this in production; others may disagree. But theoretically, if it works, it's a cool feature.)
> Thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3927) demystify failure detector, consider partial failure handling, latency optimizations

Posted by "Peter Schuller (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210148#comment-13210148 ] 

Peter Schuller commented on CASSANDRA-3927:
-------------------------------------------

Ways I can remember/think of now that I think are long the lines of something to do in the request path (some of it covered by tickets, some of it not; might hunt them down later):

* Consider number of outstanding requests to a node (least outstanding, or a soft variation thereof).
* Make the output of the dynamic snitch probabilistic instead of discrete; a node should not instantly and suddenly flap over in a replica set, but rather a slow host should be appropriately de-prioritized.
** Avoids sudden flapping behavior.
** Avoids e.g. having certain nodes be completely cold for a replica set due to snitching, followed by suddenly taking full traffic and causing poor latencies/timeouts as a result of no cache locality (happens easily with no or very low read repair).
* Data reads to all nodes (or at least multiple, maybe tunable).
* Even when normally not doing data reads to all for efficiency, do so speculatively (as Stu has mentioned in one of the other tickets).
* Be more restrictive about queueing tasks and requests; bound GC/memory footprint impact.
** For example, with RF=3 and a QUORUM request, if one replica is completely backed up - let's not even send a request to that node, but send reads to the other two. This prioritizes requests that really *need* a response from a node.
** For example, put caps on outstanding data beyond that implied by rpc timeout and request throughput.
** For example, spilling over to non-heap might be doable. Think of an off-heap (maybe persistent on disk/SSD) queue.
** Be more aggressive about dropping incoming requests.
** Never ever ever rack up hundreds of thousands or millions of tasks internally in front of a stage.
* Immediately react to the fact that a TCP connection (our only communication mechanism) got dropped (with appropriate adjustments to try to keep TCP connections up regardless of message flow).
* Support "give me data if smaller than X, only digest if not".

                
> demystify failure detector, consider partial failure handling, latency optimizations
> ------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3927
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3927
>             Project: Cassandra
>          Issue Type: Wish
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>            Priority: Minor
>
> [My aim with this ticket is to explain my current understanding of the FD and it's behavior, express some opinions, and invite others to let me know if I'm misunderstanding something.]
> So I was getting back to CASSANDRA-3569 and I re-read the ticket history, and I want to add a few things that are more about the FD in general, that I've realized since the last batch of discussion.
> Firstly, as a result of investigating gossip more, and of reading CASSANDRA-2597 (Paul Cannon's excellent break-down of what the FD actually does mathematically - thank you Paul!), I now have a much better understanding of the behavior of the failure detector than I did before. Unfortunately for the failure detector. Under the premise that the break-down in CASSANDRA-2597 (and the resulting commit to the behavior of Cassandra) is correct, if we ignore all the guassian/normal distribution stuff (I openly admit I lack the necessary math background for a rigorous analysis), the behavior of the failure detector is the following (not a quote despite use of quote syntax, I'm speaking):
> {quote}
> For a given node, keep track of the last 1000 intervals between heartbeats received via gossip (indirectly or directly from the node, doesn't matter). At any given moment, the phi "score" of the node is the *time since the last heartbeat divided by the average time between heartbeats over the past 1000 intervals* (scaled by some constant factor which is essentially ignoreable, since it is equivalent of the equivalent adjustment to convict threshold). If it goes above a certain value, we consider it down. We check for this on every gossip round (meaning about every second).
> {quote}
> I want to re-state the behavior like this because it makes it trivial to get an intuitive understanding of what it does without dwelwing into the AFD paper.
> In addition, consider that we the failure detector will *immediately* consider a node up when it receives a new heartbeat and if the node is considered down at the time.
> Further, while the accural FD paper talks about giving a non-binary score, and we do this, we don't actually use it except to trigger a binary up/down flap.
> Given the above, and general background, it seems to me that:
> * In terms of detecting that something goes down, the FD is just barely one step above just slapping a fixed timeout on heartbeats; essentially a timeout scaled relative to average historic latency.
> ** But on the other hand, it's also fuzzed (relative to simple tcp timeouts) due to o variation in gossip propagation time which is almost certainly higher than the variation in network latency between two nodes.
> * The gist of the previous two items is that the FD is really not doing anything advanced/magic or otherwise "opaque".
> * In addition, because any heartbeat from a down:ed node implies that it goes up immediately, the failure detector has very little ability to effectively do something "non-trivial" to deal with partial failures, such as demanding that a flapping node show itself healthy for a while before going back up.
> ** Indeed, as far as I can tell, if a node is slowly growing worse in heartbeat it might never get marked as down - if the rate of worsening is slow enough you'll just slowly scale the past latency history and never hit the threshold. (Untested/unconfirmed)
> * The FD is oblivious to input from real traffic (2000 000 messages backed up? doesn't matter, it's "up", even though neighboring nodes have nothing backed up). This is not necessarily wrong in any way, but it needs to be kept in mind when considering what to do *with* the FD.
> Now, CASSANDRA-3910 was recently filed where there is an attempt to use the FD for what I personally think is better dealt with in the request path (see that ticket).
> In seems to me that the FD as it works now is *definitely* not suitable to handle partial failures or smoothly redirecting traffic from anywhere, since we are converting the output of the underlying FD algo to a binary up/down state. Further even if we directly propagated current phi and used it to do relative weighting on sending requests, we still have the instant return to low phi on the very next heartbeat. It is just not suitable, as currently implemented, for anything other than binary up/down flagging as far as I can tell.
> This re-enforces, in my view, my skepticism towards CASSANDRA-2433, and my position in CASSANDRA-3294. We seem to be treating the failure detector as if it's doing something non-trivial, where in reality it isn't (gossip as a communication channel may be non-trivial, but the FD algorithm isn't). It seems to me the main function of the failure detector is that it allows us to scale to very large clusters; we need not to full-mesh ping-pong (whether at app level or at socket level) in order for up/down state to be communicated. This is a useful feature. However, it is actually a result of using gossip as the underlying channel, rather than due to anything specific to the FD algorithm (except insofar as the FD algorithm doesn't need more detailed or specific information than is reasonable to communicate over gossip).
> I believe that the FD *is* good at:
> * Efficiently (in terms of scaling to large clusters) allowing controlled shutdowns and "official" up/down marking.
> ** Useful for e.g. selecting hosts for streaming - but only for it's usefulness as "official flag" marking, not in the sense that it detects failure conditions.
> ** Nodes coming up from scratch can get a feel for the cluster immediately without having to go through an initial burst of failed messages (but currently we don't actually guarantee this anyway because we don't wait for gossip to settle on start-up - but that's a separate issue).
> I believe that the FD is *not* good at:
> * Indicating whether to send a message to a node, or how often, except for the special case of "the node is completely down, don't even bother at all, ever".
> * Optimizing for request latency, at all. It is not an effective tool to mitigate request latencies.
> * Optimizing for avoiding heap growth as requests back up; this is a very separate concern that should take into account things like relative queue sizes, past history of real requests, etc.
> * High latencies, node hiccups, congestion.
> I think that the things it is good at, is a legitimate job for the FD. The things I list under bad, I think is the job of other things like CASSANDRA-2540 and CASSANDRA-3294 to add intelligence to the way we handle requests. Even an infinitely smart FD will never work for these tasks, as long as we retain the binary up/down output (unless you want to start talking flapping up/down at high frequency. I also think that with a sufficiently good request path, we probably don't even need any active ping-pong at all, except maybe very seldom if we don't send any traffic to the node. So while using gossip for this is a cool application of it, I am unconvinced that we need active ping/pong to any great extent if we are sufficiently good at not routing requests to nodes of unknown status, and instantly prefer nodes that are actively responding to those who don't (e.g. least pending request input to routing).
> Finally, there is one additional particularly interesting feature of the failure detector and its integration with gossip the way it works now that I am aware of and feel is worth mentioning: My understanding is that the intention is for the FD, due to coupling with gossip, to guarantee that we never send messages to a node whose impact on the token ring is somehow "incorrect". So for example, on a cluster partition you're supposed to be able to make topological changes during a partition and it will just magically resolve itself when you de-partition. (I would *personally* never ever trust this in production; others may disagree. But theoretically, if it works, it's a cool feature.)
> Thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3927) demystify failure detector, consider partial failure handling, latency optimizations

Posted by "Peter Schuller (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210779#comment-13210779 ] 

Peter Schuller commented on CASSANDRA-3927:
-------------------------------------------

{quote}
Apology for being rude but JIRA is for bug and feature/improvements tracking. What is the concrete proposal of this ticket?
{quote}

Please do speak your mind, I much prefer honest criticism over holding back and being annoyed. My tendency to use JIRA is in part because my past experience with the mailing list archive has been very poor (e.g. not working in chrome at all, weird javascripty stuff, no idea if links are permanent even if they work), and I want something linkable and persistent.

{quote}
Please don't assume we're stupid
{quote}

I am not, and I'm sorry if it came across that way. Is everything here old news to absolutely everyone? Because the fact is that in various talks, discussions, blog posts etc, I have only ever, as far as I can recall, heard people refer to to the AFD paper and/or the accrual failure detector in very vague terms, and in a way which, at least to me, implies a sense of non-triviality and that there is something more significant going on there. I've never, as far as I can recall, heard/seen anyone just flat out say "it looks at the ratio between the time since last heartbeat with the average over the past N heartbeats and thresholds" or anything along those lines.

I only incidentally realized what it was doing as a result of having to look at the code carefully for other reasons, and I actually spent extra time double-checking the code just because I thought I was missing something since I had gotten the impression it was doing more than just that.

Other than the overall "people mention the AFD all the time" bit, I have just gotten the general feel that behavior of the FD is a bit opaque. (E.g. Look at the work Paul had to do for CASSANDRA-2597. And why is there an enforced maximum convict threshold? Would anyone ever do that on a timeout if it was obvious it was "just" a timeout (more or less)? And how come we have all this theory going on about how to determine when something is down in some kind of statistically smart fashion, yet just poke a box back up when a heartbeat is seen? It doesn't feel consistent. And yes I could suggest patches to those, but what would my motivation be to "remove limitation on phi" that doesn't have something like this ticket as necessary background?)

All in all, I legitimately believed (and still believe, even though I'm sure you specifically know the FD very well, and I'm talking generally) that there are plenty of people who don't know how it works - I don't think I was alone. And even without being a developer, just operating a cluster, that one sentence description of how the FD works is insanely useful in production when trying to understand what's happening and being able to predict the behavior of a cluster under hypothesized conditions - yet it is never really communicated (or maybe I'm just missing it).

I suppose my description here is at least *right* then, and it would be okay to e.g. edit the wiki to explain this behavior rather than just link to the AFD paper?

{quote}
So yes, I agree with you that failure detection could be improved and that some of those improvements probably involve using different means that the FD for some tasks. Do feel free to propose patches.
{quote}

I do hope to start submitting patches that address latency, as always depending on time, and I'm not suggesting anyone else should do a bunch of work. But I feel it helps to figure whether there is some kind of agreement on how things work, what the issues are, etc. I.e., how individual patches can fit into the bigger picture. A common difficulty in software development in my experience, is attacking problems in isolation and not seeing how everything fits together and what the real-world effects are of how individual components behave. Promoting understanding of how things work seems productive for this reason.

{quote}
But some people (at least me) try to follow what's going on in JIRA and that kind of very long ticket is just making it harder.
{quote}

I'm not sure how the length of the ticket is a concern, but I certainly see how a bunch of open tickets is a problem. I'll definitely close this one. And yes, I am also very much trying to follow JIRA. I apologies if I made it harder though.

{quote}
 It's honestly not the first ticket where I feel that you're writing everything that pops into your mind and imho a good chunk of it could be interesting in a forum but is not JIRA material.
{quote}

I do feel I have to object to "everything that pops into your mind". I spend a lot of time thinking about things, how they fit together, and what the bigger picture is. I had been meaning to post something on this for a long time (I had most of it written sitting around in a text file). I held back because I don't currently expect to start submitting patches to fix things. But then CASSANDRA-3910 was filed, which made me want to complete this and post it.

I know I tend to write long messages, but I'm not just writing random stuff that pops into my head. If it comes across like that, it's because I am not being successful in expressing what I want to express. But I am not just randomly coming up with something and start typing.



                
> demystify failure detector, consider partial failure handling, latency optimizations
> ------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3927
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3927
>             Project: Cassandra
>          Issue Type: Wish
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>            Priority: Minor
>
> [My aim with this ticket is to explain my current understanding of the FD and it's behavior, express some opinions, and invite others to let me know if I'm misunderstanding something.]
> So I was getting back to CASSANDRA-3569 and I re-read the ticket history, and I want to add a few things that are more about the FD in general, that I've realized since the last batch of discussion.
> Firstly, as a result of investigating gossip more, and of reading CASSANDRA-2597 (Paul Cannon's excellent break-down of what the FD actually does mathematically - thank you Paul!), I now have a much better understanding of the behavior of the failure detector than I did before. Unfortunately for the failure detector. Under the premise that the break-down in CASSANDRA-2597 (and the resulting commit to the behavior of Cassandra) is correct, if we ignore all the guassian/normal distribution stuff (I openly admit I lack the necessary math background for a rigorous analysis), the behavior of the failure detector is the following (not a quote despite use of quote syntax, I'm speaking):
> {quote}
> For a given node, keep track of the last 1000 intervals between heartbeats received via gossip (indirectly or directly from the node, doesn't matter). At any given moment, the phi "score" of the node is the *time since the last heartbeat divided by the average time between heartbeats over the past 1000 intervals* (scaled by some constant factor which is essentially ignoreable, since it is equivalent of the equivalent adjustment to convict threshold). If it goes above a certain value, we consider it down. We check for this on every gossip round (meaning about every second).
> {quote}
> I want to re-state the behavior like this because it makes it trivial to get an intuitive understanding of what it does without dwelwing into the AFD paper.
> In addition, consider that we the failure detector will *immediately* consider a node up when it receives a new heartbeat and if the node is considered down at the time.
> Further, while the accural FD paper talks about giving a non-binary score, and we do this, we don't actually use it except to trigger a binary up/down flap.
> Given the above, and general background, it seems to me that:
> * In terms of detecting that something goes down, the FD is just barely one step above just slapping a fixed timeout on heartbeats; essentially a timeout scaled relative to average historic latency.
> ** But on the other hand, it's also fuzzed (relative to simple tcp timeouts) due to o variation in gossip propagation time which is almost certainly higher than the variation in network latency between two nodes.
> * The gist of the previous two items is that the FD is really not doing anything advanced/magic or otherwise "opaque".
> * In addition, because any heartbeat from a down:ed node implies that it goes up immediately, the failure detector has very little ability to effectively do something "non-trivial" to deal with partial failures, such as demanding that a flapping node show itself healthy for a while before going back up.
> ** Indeed, as far as I can tell, if a node is slowly growing worse in heartbeat it might never get marked as down - if the rate of worsening is slow enough you'll just slowly scale the past latency history and never hit the threshold. (Untested/unconfirmed)
> * The FD is oblivious to input from real traffic (2000 000 messages backed up? doesn't matter, it's "up", even though neighboring nodes have nothing backed up). This is not necessarily wrong in any way, but it needs to be kept in mind when considering what to do *with* the FD.
> Now, CASSANDRA-3910 was recently filed where there is an attempt to use the FD for what I personally think is better dealt with in the request path (see that ticket).
> In seems to me that the FD as it works now is *definitely* not suitable to handle partial failures or smoothly redirecting traffic from anywhere, since we are converting the output of the underlying FD algo to a binary up/down state. Further even if we directly propagated current phi and used it to do relative weighting on sending requests, we still have the instant return to low phi on the very next heartbeat. It is just not suitable, as currently implemented, for anything other than binary up/down flagging as far as I can tell.
> This re-enforces, in my view, my skepticism towards CASSANDRA-2433, and my position in CASSANDRA-3294. We seem to be treating the failure detector as if it's doing something non-trivial, where in reality it isn't (gossip as a communication channel may be non-trivial, but the FD algorithm isn't). It seems to me the main function of the failure detector is that it allows us to scale to very large clusters; we need not to full-mesh ping-pong (whether at app level or at socket level) in order for up/down state to be communicated. This is a useful feature. However, it is actually a result of using gossip as the underlying channel, rather than due to anything specific to the FD algorithm (except insofar as the FD algorithm doesn't need more detailed or specific information than is reasonable to communicate over gossip).
> I believe that the FD *is* good at:
> * Efficiently (in terms of scaling to large clusters) allowing controlled shutdowns and "official" up/down marking.
> ** Useful for e.g. selecting hosts for streaming - but only for it's usefulness as "official flag" marking, not in the sense that it detects failure conditions.
> ** Nodes coming up from scratch can get a feel for the cluster immediately without having to go through an initial burst of failed messages (but currently we don't actually guarantee this anyway because we don't wait for gossip to settle on start-up - but that's a separate issue).
> I believe that the FD is *not* good at:
> * Indicating whether to send a message to a node, or how often, except for the special case of "the node is completely down, don't even bother at all, ever".
> * Optimizing for request latency, at all. It is not an effective tool to mitigate request latencies.
> * Optimizing for avoiding heap growth as requests back up; this is a very separate concern that should take into account things like relative queue sizes, past history of real requests, etc.
> * High latencies, node hiccups, congestion.
> I think that the things it is good at, is a legitimate job for the FD. The things I list under bad, I think is the job of other things like CASSANDRA-2540 and CASSANDRA-3294 to add intelligence to the way we handle requests. Even an infinitely smart FD will never work for these tasks, as long as we retain the binary up/down output (unless you want to start talking flapping up/down at high frequency. I also think that with a sufficiently good request path, we probably don't even need any active ping-pong at all, except maybe very seldom if we don't send any traffic to the node. So while using gossip for this is a cool application of it, I am unconvinced that we need active ping/pong to any great extent if we are sufficiently good at not routing requests to nodes of unknown status, and instantly prefer nodes that are actively responding to those who don't (e.g. least pending request input to routing).
> Finally, there is one additional particularly interesting feature of the failure detector and its integration with gossip the way it works now that I am aware of and feel is worth mentioning: My understanding is that the intention is for the FD, due to coupling with gossip, to guarantee that we never send messages to a node whose impact on the token ring is somehow "incorrect". So for example, on a cluster partition you're supposed to be able to make topological changes during a partition and it will just magically resolve itself when you de-partition. (I would *personally* never ever trust this in production; others may disagree. But theoretically, if it works, it's a cool feature.)
> Thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (CASSANDRA-3927) demystify failure detector, consider partial failure handling, latency optimizations

Posted by "Peter Schuller (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-3927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Peter Schuller resolved CASSANDRA-3927.
---------------------------------------

    Resolution: Invalid
    
> demystify failure detector, consider partial failure handling, latency optimizations
> ------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3927
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3927
>             Project: Cassandra
>          Issue Type: Wish
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>            Priority: Minor
>
> [My aim with this ticket is to explain my current understanding of the FD and it's behavior, express some opinions, and invite others to let me know if I'm misunderstanding something.]
> So I was getting back to CASSANDRA-3569 and I re-read the ticket history, and I want to add a few things that are more about the FD in general, that I've realized since the last batch of discussion.
> Firstly, as a result of investigating gossip more, and of reading CASSANDRA-2597 (Paul Cannon's excellent break-down of what the FD actually does mathematically - thank you Paul!), I now have a much better understanding of the behavior of the failure detector than I did before. Unfortunately for the failure detector. Under the premise that the break-down in CASSANDRA-2597 (and the resulting commit to the behavior of Cassandra) is correct, if we ignore all the guassian/normal distribution stuff (I openly admit I lack the necessary math background for a rigorous analysis), the behavior of the failure detector is the following (not a quote despite use of quote syntax, I'm speaking):
> {quote}
> For a given node, keep track of the last 1000 intervals between heartbeats received via gossip (indirectly or directly from the node, doesn't matter). At any given moment, the phi "score" of the node is the *time since the last heartbeat divided by the average time between heartbeats over the past 1000 intervals* (scaled by some constant factor which is essentially ignoreable, since it is equivalent of the equivalent adjustment to convict threshold). If it goes above a certain value, we consider it down. We check for this on every gossip round (meaning about every second).
> {quote}
> I want to re-state the behavior like this because it makes it trivial to get an intuitive understanding of what it does without dwelwing into the AFD paper.
> In addition, consider that we the failure detector will *immediately* consider a node up when it receives a new heartbeat and if the node is considered down at the time.
> Further, while the accural FD paper talks about giving a non-binary score, and we do this, we don't actually use it except to trigger a binary up/down flap.
> Given the above, and general background, it seems to me that:
> * In terms of detecting that something goes down, the FD is just barely one step above just slapping a fixed timeout on heartbeats; essentially a timeout scaled relative to average historic latency.
> ** But on the other hand, it's also fuzzed (relative to simple tcp timeouts) due to o variation in gossip propagation time which is almost certainly higher than the variation in network latency between two nodes.
> * The gist of the previous two items is that the FD is really not doing anything advanced/magic or otherwise "opaque".
> * In addition, because any heartbeat from a down:ed node implies that it goes up immediately, the failure detector has very little ability to effectively do something "non-trivial" to deal with partial failures, such as demanding that a flapping node show itself healthy for a while before going back up.
> ** Indeed, as far as I can tell, if a node is slowly growing worse in heartbeat it might never get marked as down - if the rate of worsening is slow enough you'll just slowly scale the past latency history and never hit the threshold. (Untested/unconfirmed)
> * The FD is oblivious to input from real traffic (2000 000 messages backed up? doesn't matter, it's "up", even though neighboring nodes have nothing backed up). This is not necessarily wrong in any way, but it needs to be kept in mind when considering what to do *with* the FD.
> Now, CASSANDRA-3910 was recently filed where there is an attempt to use the FD for what I personally think is better dealt with in the request path (see that ticket).
> In seems to me that the FD as it works now is *definitely* not suitable to handle partial failures or smoothly redirecting traffic from anywhere, since we are converting the output of the underlying FD algo to a binary up/down state. Further even if we directly propagated current phi and used it to do relative weighting on sending requests, we still have the instant return to low phi on the very next heartbeat. It is just not suitable, as currently implemented, for anything other than binary up/down flagging as far as I can tell.
> This re-enforces, in my view, my skepticism towards CASSANDRA-2433, and my position in CASSANDRA-3294. We seem to be treating the failure detector as if it's doing something non-trivial, where in reality it isn't (gossip as a communication channel may be non-trivial, but the FD algorithm isn't). It seems to me the main function of the failure detector is that it allows us to scale to very large clusters; we need not to full-mesh ping-pong (whether at app level or at socket level) in order for up/down state to be communicated. This is a useful feature. However, it is actually a result of using gossip as the underlying channel, rather than due to anything specific to the FD algorithm (except insofar as the FD algorithm doesn't need more detailed or specific information than is reasonable to communicate over gossip).
> I believe that the FD *is* good at:
> * Efficiently (in terms of scaling to large clusters) allowing controlled shutdowns and "official" up/down marking.
> ** Useful for e.g. selecting hosts for streaming - but only for it's usefulness as "official flag" marking, not in the sense that it detects failure conditions.
> ** Nodes coming up from scratch can get a feel for the cluster immediately without having to go through an initial burst of failed messages (but currently we don't actually guarantee this anyway because we don't wait for gossip to settle on start-up - but that's a separate issue).
> I believe that the FD is *not* good at:
> * Indicating whether to send a message to a node, or how often, except for the special case of "the node is completely down, don't even bother at all, ever".
> * Optimizing for request latency, at all. It is not an effective tool to mitigate request latencies.
> * Optimizing for avoiding heap growth as requests back up; this is a very separate concern that should take into account things like relative queue sizes, past history of real requests, etc.
> * High latencies, node hiccups, congestion.
> I think that the things it is good at, is a legitimate job for the FD. The things I list under bad, I think is the job of other things like CASSANDRA-2540 and CASSANDRA-3294 to add intelligence to the way we handle requests. Even an infinitely smart FD will never work for these tasks, as long as we retain the binary up/down output (unless you want to start talking flapping up/down at high frequency. I also think that with a sufficiently good request path, we probably don't even need any active ping-pong at all, except maybe very seldom if we don't send any traffic to the node. So while using gossip for this is a cool application of it, I am unconvinced that we need active ping/pong to any great extent if we are sufficiently good at not routing requests to nodes of unknown status, and instantly prefer nodes that are actively responding to those who don't (e.g. least pending request input to routing).
> Finally, there is one additional particularly interesting feature of the failure detector and its integration with gossip the way it works now that I am aware of and feel is worth mentioning: My understanding is that the intention is for the FD, due to coupling with gossip, to guarantee that we never send messages to a node whose impact on the token ring is somehow "incorrect". So for example, on a cluster partition you're supposed to be able to make topological changes during a partition and it will just magically resolve itself when you de-partition. (I would *personally* never ever trust this in production; others may disagree. But theoretically, if it works, it's a cool feature.)
> Thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3927) demystify failure detector, consider partial failure handling, latency optimizations

Posted by "Jonathan Ellis (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13210815#comment-13210815 ] 

Jonathan Ellis commented on CASSANDRA-3927:
-------------------------------------------

Granted, the ML archive options we have are pretty awful.  If creating a linkable record is your goal, perhaps some "here is my thinking through of X" trains of thought are best suited to a blog post?  Creating jira tickets is a Call To Action, and it's a exhausting to parse through 1300+ words looking for what action is actually being proposed, to come up empty handed or with an exercise-for-the-reader.

On a related note, I personally very much appreciate writing that is structured as abstract/summary followed by paragraphs each of which is introduced by topic sentences.  This lets me as a reader decide where I need to dig in and where it's okay to skim.  (Newspaper articles are invariably written this way, incidentally, if I may invoke that soon-to-be-obsolete medium.)

Admittedly writing this way takes more time, sometimes a lot more time, but it's worth it when communicating in a group setting like this when many people are trying to keep up with a large volume of tickets and proposals.

                
> demystify failure detector, consider partial failure handling, latency optimizations
> ------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3927
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3927
>             Project: Cassandra
>          Issue Type: Wish
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>            Priority: Minor
>
> [My aim with this ticket is to explain my current understanding of the FD and it's behavior, express some opinions, and invite others to let me know if I'm misunderstanding something.]
> So I was getting back to CASSANDRA-3569 and I re-read the ticket history, and I want to add a few things that are more about the FD in general, that I've realized since the last batch of discussion.
> Firstly, as a result of investigating gossip more, and of reading CASSANDRA-2597 (Paul Cannon's excellent break-down of what the FD actually does mathematically - thank you Paul!), I now have a much better understanding of the behavior of the failure detector than I did before. Unfortunately for the failure detector. Under the premise that the break-down in CASSANDRA-2597 (and the resulting commit to the behavior of Cassandra) is correct, if we ignore all the guassian/normal distribution stuff (I openly admit I lack the necessary math background for a rigorous analysis), the behavior of the failure detector is the following (not a quote despite use of quote syntax, I'm speaking):
> {quote}
> For a given node, keep track of the last 1000 intervals between heartbeats received via gossip (indirectly or directly from the node, doesn't matter). At any given moment, the phi "score" of the node is the *time since the last heartbeat divided by the average time between heartbeats over the past 1000 intervals* (scaled by some constant factor which is essentially ignoreable, since it is equivalent of the equivalent adjustment to convict threshold). If it goes above a certain value, we consider it down. We check for this on every gossip round (meaning about every second).
> {quote}
> I want to re-state the behavior like this because it makes it trivial to get an intuitive understanding of what it does without dwelwing into the AFD paper.
> In addition, consider that we the failure detector will *immediately* consider a node up when it receives a new heartbeat and if the node is considered down at the time.
> Further, while the accural FD paper talks about giving a non-binary score, and we do this, we don't actually use it except to trigger a binary up/down flap.
> Given the above, and general background, it seems to me that:
> * In terms of detecting that something goes down, the FD is just barely one step above just slapping a fixed timeout on heartbeats; essentially a timeout scaled relative to average historic latency.
> ** But on the other hand, it's also fuzzed (relative to simple tcp timeouts) due to o variation in gossip propagation time which is almost certainly higher than the variation in network latency between two nodes.
> * The gist of the previous two items is that the FD is really not doing anything advanced/magic or otherwise "opaque".
> * In addition, because any heartbeat from a down:ed node implies that it goes up immediately, the failure detector has very little ability to effectively do something "non-trivial" to deal with partial failures, such as demanding that a flapping node show itself healthy for a while before going back up.
> ** Indeed, as far as I can tell, if a node is slowly growing worse in heartbeat it might never get marked as down - if the rate of worsening is slow enough you'll just slowly scale the past latency history and never hit the threshold. (Untested/unconfirmed)
> * The FD is oblivious to input from real traffic (2000 000 messages backed up? doesn't matter, it's "up", even though neighboring nodes have nothing backed up). This is not necessarily wrong in any way, but it needs to be kept in mind when considering what to do *with* the FD.
> Now, CASSANDRA-3910 was recently filed where there is an attempt to use the FD for what I personally think is better dealt with in the request path (see that ticket).
> In seems to me that the FD as it works now is *definitely* not suitable to handle partial failures or smoothly redirecting traffic from anywhere, since we are converting the output of the underlying FD algo to a binary up/down state. Further even if we directly propagated current phi and used it to do relative weighting on sending requests, we still have the instant return to low phi on the very next heartbeat. It is just not suitable, as currently implemented, for anything other than binary up/down flagging as far as I can tell.
> This re-enforces, in my view, my skepticism towards CASSANDRA-2433, and my position in CASSANDRA-3294. We seem to be treating the failure detector as if it's doing something non-trivial, where in reality it isn't (gossip as a communication channel may be non-trivial, but the FD algorithm isn't). It seems to me the main function of the failure detector is that it allows us to scale to very large clusters; we need not to full-mesh ping-pong (whether at app level or at socket level) in order for up/down state to be communicated. This is a useful feature. However, it is actually a result of using gossip as the underlying channel, rather than due to anything specific to the FD algorithm (except insofar as the FD algorithm doesn't need more detailed or specific information than is reasonable to communicate over gossip).
> I believe that the FD *is* good at:
> * Efficiently (in terms of scaling to large clusters) allowing controlled shutdowns and "official" up/down marking.
> ** Useful for e.g. selecting hosts for streaming - but only for it's usefulness as "official flag" marking, not in the sense that it detects failure conditions.
> ** Nodes coming up from scratch can get a feel for the cluster immediately without having to go through an initial burst of failed messages (but currently we don't actually guarantee this anyway because we don't wait for gossip to settle on start-up - but that's a separate issue).
> I believe that the FD is *not* good at:
> * Indicating whether to send a message to a node, or how often, except for the special case of "the node is completely down, don't even bother at all, ever".
> * Optimizing for request latency, at all. It is not an effective tool to mitigate request latencies.
> * Optimizing for avoiding heap growth as requests back up; this is a very separate concern that should take into account things like relative queue sizes, past history of real requests, etc.
> * High latencies, node hiccups, congestion.
> I think that the things it is good at, is a legitimate job for the FD. The things I list under bad, I think is the job of other things like CASSANDRA-2540 and CASSANDRA-3294 to add intelligence to the way we handle requests. Even an infinitely smart FD will never work for these tasks, as long as we retain the binary up/down output (unless you want to start talking flapping up/down at high frequency. I also think that with a sufficiently good request path, we probably don't even need any active ping-pong at all, except maybe very seldom if we don't send any traffic to the node. So while using gossip for this is a cool application of it, I am unconvinced that we need active ping/pong to any great extent if we are sufficiently good at not routing requests to nodes of unknown status, and instantly prefer nodes that are actively responding to those who don't (e.g. least pending request input to routing).
> Finally, there is one additional particularly interesting feature of the failure detector and its integration with gossip the way it works now that I am aware of and feel is worth mentioning: My understanding is that the intention is for the FD, due to coupling with gossip, to guarantee that we never send messages to a node whose impact on the token ring is somehow "incorrect". So for example, on a cluster partition you're supposed to be able to make topological changes during a partition and it will just magically resolve itself when you de-partition. (I would *personally* never ever trust this in production; others may disagree. But theoretically, if it works, it's a cool feature.)
> Thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira