You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Peter Schuller (Created) (JIRA)" <ji...@apache.org> on 2012/02/02 00:43:53 UTC

[jira] [Created] (CASSANDRA-3830) gossip-to-seeds is not obviously independent of failure detection algorithm

gossip-to-seeds is not obviously independent of failure detection algorithm 
----------------------------------------------------------------------------

                 Key: CASSANDRA-3830
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3830
             Project: Cassandra
          Issue Type: Task
          Components: Core
            Reporter: Peter Schuller
            Priority: Minor


The failure detector, ignoring all the theory, boils down to an
extremely simple algorithm. The FD keeps track of a sliding window (of
1000 currently) intervals of heartbeat for a given host. Meaning, we
have a track record of the last 1000 times we saw an updated heartbeat
for a host.

At any given moment, a host has a score which is simply the time since
the last heartbeat, over the *mean* interval in the sliding
window. For historical reasons a simple scaling factor is applied to
this prior to checking the phi conviction threshold.

(CASSANDRA-2597 has details, but thanks to Paul's work there it's now
trivial to understand what it does based on gut feeling)

So in effect, a host is considered down if we haven't heard from it in
some time which is significantly longer than the "average" time we
expect to hear from it.

This seems reasonable, but it does assume that under normal conditions
the average time between heartbeats does not change for reasons other
than those that would be plausible reasons to think a node is
unhealthy.

This assumption *could* be violated by the gossip-to-seed
feature. There is an argument to avoid gossip-to-seed for other
reasons (see CASSANDRA-3829), but this is a concrete case in which the
gossip-to-seed could cause a negative side-effect of the general kind
mentioned in CASSANDRA-3829 (see notes at end about not case w/o seeds
not being continuously tested). Normally, due to gossip to seed,
everyone essentially sees latest information within very few hart
beats (assuming only 2-3 seeds). But should all seeds be down,
suddenly we flip a switch and start relying on generalized propagation
in the gossip system, rather than the seed special case.

The potential problem I forese here is that if the average propagation
time suddenly spikes when all seeds become available, it could cause
bogus flapping of nodes into down state.

In order to test this, I deployeda ~ 180 node cluster with a version
that logs heartbet information on each interpret(), similar to:

 INFO [GossipTasks:1] 2012-02-01 23:29:58,746 FailureDetector.java (line 187) ep /XXX.XXX.XXX.XXX is at phi 0.0019521638443084342, last interval 7.0, mean is 1557.2777777777778

It turns out that, at least at 180 nodes, with 4 seed nodes, whether
or not seeds are running *does not* seem to matter significantly. In
both cases, the mean interval is around 1500 milliseconds.

I don't feel I have a good grasp of whether this is incidental or
guaranteed, and it would be good to at least empirically test
propagation time w/o seeds at differnet cluster sizes; it's supposed
to be un-affected by cluster size ({{RING_DELAY}} is static for this
reason, is my understanding). Would be nice to see this be the case.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3830) gossip-to-seeds is not obviously independent of failure detection algorithm

Posted by "Peter Schuller (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13207176#comment-13207176 ] 

Peter Schuller commented on CASSANDRA-3830:
-------------------------------------------

Correct, and the concern is that when the optimization is "removed" (e.g., by seeds being down), that might affect the failure detector if the average heartbeat interval ends up being affected.
                
> gossip-to-seeds is not obviously independent of failure detection algorithm 
> ----------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3830
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3830
>             Project: Cassandra
>          Issue Type: Task
>          Components: Core
>            Reporter: Peter Schuller
>            Priority: Minor
>
> The failure detector, ignoring all the theory, boils down to an
> extremely simple algorithm. The FD keeps track of a sliding window (of
> 1000 currently) intervals of heartbeat for a given host. Meaning, we
> have a track record of the last 1000 times we saw an updated heartbeat
> for a host.
> At any given moment, a host has a score which is simply the time since
> the last heartbeat, over the *mean* interval in the sliding
> window. For historical reasons a simple scaling factor is applied to
> this prior to checking the phi conviction threshold.
> (CASSANDRA-2597 has details, but thanks to Paul's work there it's now
> trivial to understand what it does based on gut feeling)
> So in effect, a host is considered down if we haven't heard from it in
> some time which is significantly longer than the "average" time we
> expect to hear from it.
> This seems reasonable, but it does assume that under normal conditions
> the average time between heartbeats does not change for reasons other
> than those that would be plausible reasons to think a node is
> unhealthy.
> This assumption *could* be violated by the gossip-to-seed
> feature. There is an argument to avoid gossip-to-seed for other
> reasons (see CASSANDRA-3829), but this is a concrete case in which the
> gossip-to-seed could cause a negative side-effect of the general kind
> mentioned in CASSANDRA-3829 (see notes at end about not case w/o seeds
> not being continuously tested). Normally, due to gossip to seed,
> everyone essentially sees latest information within very few hart
> beats (assuming only 2-3 seeds). But should all seeds be down,
> suddenly we flip a switch and start relying on generalized propagation
> in the gossip system, rather than the seed special case.
> The potential problem I forese here is that if the average propagation
> time suddenly spikes when all seeds become available, it could cause
> bogus flapping of nodes into down state.
> In order to test this, I deployeda ~ 180 node cluster with a version
> that logs heartbet information on each interpret(), similar to:
>  INFO [GossipTasks:1] 2012-02-01 23:29:58,746 FailureDetector.java (line 187) ep /XXX.XXX.XXX.XXX is at phi 0.0019521638443084342, last interval 7.0, mean is 1557.2777777777778
> It turns out that, at least at 180 nodes, with 4 seed nodes, whether
> or not seeds are running *does not* seem to matter significantly. In
> both cases, the mean interval is around 1500 milliseconds.
> I don't feel I have a good grasp of whether this is incidental or
> guaranteed, and it would be good to at least empirically test
> propagation time w/o seeds at differnet cluster sizes; it's supposed
> to be un-affected by cluster size ({{RING_DELAY}} is static for this
> reason, is my understanding). Would be nice to see this be the case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3830) gossip-to-seeds is not obviously independent of failure detection algorithm

Posted by "Brandon Williams (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13198466#comment-13198466 ] 

Brandon Williams commented on CASSANDRA-3830:
---------------------------------------------

bq. This assumption could be violated by the gossip-to-seed feature.

I don't understand; the only time we explicitly gossip to a seed is when the number of live endpoints is less than the number of defined seeds.
                
> gossip-to-seeds is not obviously independent of failure detection algorithm 
> ----------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3830
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3830
>             Project: Cassandra
>          Issue Type: Task
>          Components: Core
>            Reporter: Peter Schuller
>            Priority: Minor
>
> The failure detector, ignoring all the theory, boils down to an
> extremely simple algorithm. The FD keeps track of a sliding window (of
> 1000 currently) intervals of heartbeat for a given host. Meaning, we
> have a track record of the last 1000 times we saw an updated heartbeat
> for a host.
> At any given moment, a host has a score which is simply the time since
> the last heartbeat, over the *mean* interval in the sliding
> window. For historical reasons a simple scaling factor is applied to
> this prior to checking the phi conviction threshold.
> (CASSANDRA-2597 has details, but thanks to Paul's work there it's now
> trivial to understand what it does based on gut feeling)
> So in effect, a host is considered down if we haven't heard from it in
> some time which is significantly longer than the "average" time we
> expect to hear from it.
> This seems reasonable, but it does assume that under normal conditions
> the average time between heartbeats does not change for reasons other
> than those that would be plausible reasons to think a node is
> unhealthy.
> This assumption *could* be violated by the gossip-to-seed
> feature. There is an argument to avoid gossip-to-seed for other
> reasons (see CASSANDRA-3829), but this is a concrete case in which the
> gossip-to-seed could cause a negative side-effect of the general kind
> mentioned in CASSANDRA-3829 (see notes at end about not case w/o seeds
> not being continuously tested). Normally, due to gossip to seed,
> everyone essentially sees latest information within very few hart
> beats (assuming only 2-3 seeds). But should all seeds be down,
> suddenly we flip a switch and start relying on generalized propagation
> in the gossip system, rather than the seed special case.
> The potential problem I forese here is that if the average propagation
> time suddenly spikes when all seeds become available, it could cause
> bogus flapping of nodes into down state.
> In order to test this, I deployeda ~ 180 node cluster with a version
> that logs heartbet information on each interpret(), similar to:
>  INFO [GossipTasks:1] 2012-02-01 23:29:58,746 FailureDetector.java (line 187) ep /XXX.XXX.XXX.XXX is at phi 0.0019521638443084342, last interval 7.0, mean is 1557.2777777777778
> It turns out that, at least at 180 nodes, with 4 seed nodes, whether
> or not seeds are running *does not* seem to matter significantly. In
> both cases, the mean interval is around 1500 milliseconds.
> I don't feel I have a good grasp of whether this is incidental or
> guaranteed, and it would be good to at least empirically test
> propagation time w/o seeds at differnet cluster sizes; it's supposed
> to be un-affected by cluster size ({{RING_DELAY}} is static for this
> reason, is my understanding). Would be nice to see this be the case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (CASSANDRA-3830) gossip-to-seeds is not obviously independent of failure detection algorithm

Posted by "Peter Schuller (Assigned) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Peter Schuller reassigned CASSANDRA-3830:
-----------------------------------------

    Assignee: Peter Schuller
    
> gossip-to-seeds is not obviously independent of failure detection algorithm 
> ----------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3830
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3830
>             Project: Cassandra
>          Issue Type: Task
>          Components: Core
>            Reporter: Peter Schuller
>            Assignee: Peter Schuller
>            Priority: Minor
>
> The failure detector, ignoring all the theory, boils down to an
> extremely simple algorithm. The FD keeps track of a sliding window (of
> 1000 currently) intervals of heartbeat for a given host. Meaning, we
> have a track record of the last 1000 times we saw an updated heartbeat
> for a host.
> At any given moment, a host has a score which is simply the time since
> the last heartbeat, over the *mean* interval in the sliding
> window. For historical reasons a simple scaling factor is applied to
> this prior to checking the phi conviction threshold.
> (CASSANDRA-2597 has details, but thanks to Paul's work there it's now
> trivial to understand what it does based on gut feeling)
> So in effect, a host is considered down if we haven't heard from it in
> some time which is significantly longer than the "average" time we
> expect to hear from it.
> This seems reasonable, but it does assume that under normal conditions
> the average time between heartbeats does not change for reasons other
> than those that would be plausible reasons to think a node is
> unhealthy.
> This assumption *could* be violated by the gossip-to-seed
> feature. There is an argument to avoid gossip-to-seed for other
> reasons (see CASSANDRA-3829), but this is a concrete case in which the
> gossip-to-seed could cause a negative side-effect of the general kind
> mentioned in CASSANDRA-3829 (see notes at end about not case w/o seeds
> not being continuously tested). Normally, due to gossip to seed,
> everyone essentially sees latest information within very few hart
> beats (assuming only 2-3 seeds). But should all seeds be down,
> suddenly we flip a switch and start relying on generalized propagation
> in the gossip system, rather than the seed special case.
> The potential problem I forese here is that if the average propagation
> time suddenly spikes when all seeds become available, it could cause
> bogus flapping of nodes into down state.
> In order to test this, I deployeda ~ 180 node cluster with a version
> that logs heartbet information on each interpret(), similar to:
>  INFO [GossipTasks:1] 2012-02-01 23:29:58,746 FailureDetector.java (line 187) ep /XXX.XXX.XXX.XXX is at phi 0.0019521638443084342, last interval 7.0, mean is 1557.2777777777778
> It turns out that, at least at 180 nodes, with 4 seed nodes, whether
> or not seeds are running *does not* seem to matter significantly. In
> both cases, the mean interval is around 1500 milliseconds.
> I don't feel I have a good grasp of whether this is incidental or
> guaranteed, and it would be good to at least empirically test
> propagation time w/o seeds at differnet cluster sizes; it's supposed
> to be un-affected by cluster size ({{RING_DELAY}} is static for this
> reason, is my understanding). Would be nice to see this be the case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3830) gossip-to-seeds is not obviously independent of failure detection algorithm

Posted by "Brandon Williams (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13206857#comment-13206857 ] 

Brandon Williams commented on CASSANDRA-3830:
---------------------------------------------

bq. What you describe is not the behavior of the Gossiper. It picks a random node to gossip to. Then, unless the node happened to also be a seed node, it picks a random seed node to gossip to as well.

Right.

bq. The "less than number of seeds" you're mentioning

What I meant to say is this is the only special-case for seeds; gossiping to at least one seed every round is the normal case, as you said.
                
> gossip-to-seeds is not obviously independent of failure detection algorithm 
> ----------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3830
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3830
>             Project: Cassandra
>          Issue Type: Task
>          Components: Core
>            Reporter: Peter Schuller
>            Priority: Minor
>
> The failure detector, ignoring all the theory, boils down to an
> extremely simple algorithm. The FD keeps track of a sliding window (of
> 1000 currently) intervals of heartbeat for a given host. Meaning, we
> have a track record of the last 1000 times we saw an updated heartbeat
> for a host.
> At any given moment, a host has a score which is simply the time since
> the last heartbeat, over the *mean* interval in the sliding
> window. For historical reasons a simple scaling factor is applied to
> this prior to checking the phi conviction threshold.
> (CASSANDRA-2597 has details, but thanks to Paul's work there it's now
> trivial to understand what it does based on gut feeling)
> So in effect, a host is considered down if we haven't heard from it in
> some time which is significantly longer than the "average" time we
> expect to hear from it.
> This seems reasonable, but it does assume that under normal conditions
> the average time between heartbeats does not change for reasons other
> than those that would be plausible reasons to think a node is
> unhealthy.
> This assumption *could* be violated by the gossip-to-seed
> feature. There is an argument to avoid gossip-to-seed for other
> reasons (see CASSANDRA-3829), but this is a concrete case in which the
> gossip-to-seed could cause a negative side-effect of the general kind
> mentioned in CASSANDRA-3829 (see notes at end about not case w/o seeds
> not being continuously tested). Normally, due to gossip to seed,
> everyone essentially sees latest information within very few hart
> beats (assuming only 2-3 seeds). But should all seeds be down,
> suddenly we flip a switch and start relying on generalized propagation
> in the gossip system, rather than the seed special case.
> The potential problem I forese here is that if the average propagation
> time suddenly spikes when all seeds become available, it could cause
> bogus flapping of nodes into down state.
> In order to test this, I deployeda ~ 180 node cluster with a version
> that logs heartbet information on each interpret(), similar to:
>  INFO [GossipTasks:1] 2012-02-01 23:29:58,746 FailureDetector.java (line 187) ep /XXX.XXX.XXX.XXX is at phi 0.0019521638443084342, last interval 7.0, mean is 1557.2777777777778
> It turns out that, at least at 180 nodes, with 4 seed nodes, whether
> or not seeds are running *does not* seem to matter significantly. In
> both cases, the mean interval is around 1500 milliseconds.
> I don't feel I have a good grasp of whether this is incidental or
> guaranteed, and it would be good to at least empirically test
> propagation time w/o seeds at differnet cluster sizes; it's supposed
> to be un-affected by cluster size ({{RING_DELAY}} is static for this
> reason, is my understanding). Would be nice to see this be the case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3830) gossip-to-seeds is not obviously independent of failure detection algorithm

Posted by "Peter Schuller (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13198447#comment-13198447 ] 

Peter Schuller commented on CASSANDRA-3830:
-------------------------------------------

I would expect to see an actual difference with only a single seed, vs. no seeds. Not tested, will try to.
                
> gossip-to-seeds is not obviously independent of failure detection algorithm 
> ----------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3830
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3830
>             Project: Cassandra
>          Issue Type: Task
>          Components: Core
>            Reporter: Peter Schuller
>            Priority: Minor
>
> The failure detector, ignoring all the theory, boils down to an
> extremely simple algorithm. The FD keeps track of a sliding window (of
> 1000 currently) intervals of heartbeat for a given host. Meaning, we
> have a track record of the last 1000 times we saw an updated heartbeat
> for a host.
> At any given moment, a host has a score which is simply the time since
> the last heartbeat, over the *mean* interval in the sliding
> window. For historical reasons a simple scaling factor is applied to
> this prior to checking the phi conviction threshold.
> (CASSANDRA-2597 has details, but thanks to Paul's work there it's now
> trivial to understand what it does based on gut feeling)
> So in effect, a host is considered down if we haven't heard from it in
> some time which is significantly longer than the "average" time we
> expect to hear from it.
> This seems reasonable, but it does assume that under normal conditions
> the average time between heartbeats does not change for reasons other
> than those that would be plausible reasons to think a node is
> unhealthy.
> This assumption *could* be violated by the gossip-to-seed
> feature. There is an argument to avoid gossip-to-seed for other
> reasons (see CASSANDRA-3829), but this is a concrete case in which the
> gossip-to-seed could cause a negative side-effect of the general kind
> mentioned in CASSANDRA-3829 (see notes at end about not case w/o seeds
> not being continuously tested). Normally, due to gossip to seed,
> everyone essentially sees latest information within very few hart
> beats (assuming only 2-3 seeds). But should all seeds be down,
> suddenly we flip a switch and start relying on generalized propagation
> in the gossip system, rather than the seed special case.
> The potential problem I forese here is that if the average propagation
> time suddenly spikes when all seeds become available, it could cause
> bogus flapping of nodes into down state.
> In order to test this, I deployeda ~ 180 node cluster with a version
> that logs heartbet information on each interpret(), similar to:
>  INFO [GossipTasks:1] 2012-02-01 23:29:58,746 FailureDetector.java (line 187) ep /XXX.XXX.XXX.XXX is at phi 0.0019521638443084342, last interval 7.0, mean is 1557.2777777777778
> It turns out that, at least at 180 nodes, with 4 seed nodes, whether
> or not seeds are running *does not* seem to matter significantly. In
> both cases, the mean interval is around 1500 milliseconds.
> I don't feel I have a good grasp of whether this is incidental or
> guaranteed, and it would be good to at least empirically test
> propagation time w/o seeds at differnet cluster sizes; it's supposed
> to be un-affected by cluster size ({{RING_DELAY}} is static for this
> reason, is my understanding). Would be nice to see this be the case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3830) gossip-to-seeds is not obviously independent of failure detection algorithm

Posted by "Peter Schuller (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13207060#comment-13207060 ] 

Peter Schuller commented on CASSANDRA-3830:
-------------------------------------------

{quote}
What I meant to say is this is the only special-case for seeds; gossiping to at least one seed every round is the normal case, as you said.
{quote}

Ah. So what I mean by gossip being a special case, is the fact that we have the gossip-to-seed logic at all. Part of the core aspects of gossip is the propagation delay and whether and to what extent it is affected by things like cluster size. My concern is that all production clusters that follow the recommendation w.r.t. seeds are all working well potentially only because of the fact that we are gossiping to seeds. It's trivial to see that if we have a bunch of N servers all gossiping to a small set of 2-4 servers, propagation delay is not going to be a major problem as long as at least one of those are up.

Anyways, I'll try to get to graphing average propagation delay as a function of cluster size (along with p99:s or something) and see if there seems to be a correlation or not.
                
> gossip-to-seeds is not obviously independent of failure detection algorithm 
> ----------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3830
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3830
>             Project: Cassandra
>          Issue Type: Task
>          Components: Core
>            Reporter: Peter Schuller
>            Priority: Minor
>
> The failure detector, ignoring all the theory, boils down to an
> extremely simple algorithm. The FD keeps track of a sliding window (of
> 1000 currently) intervals of heartbeat for a given host. Meaning, we
> have a track record of the last 1000 times we saw an updated heartbeat
> for a host.
> At any given moment, a host has a score which is simply the time since
> the last heartbeat, over the *mean* interval in the sliding
> window. For historical reasons a simple scaling factor is applied to
> this prior to checking the phi conviction threshold.
> (CASSANDRA-2597 has details, but thanks to Paul's work there it's now
> trivial to understand what it does based on gut feeling)
> So in effect, a host is considered down if we haven't heard from it in
> some time which is significantly longer than the "average" time we
> expect to hear from it.
> This seems reasonable, but it does assume that under normal conditions
> the average time between heartbeats does not change for reasons other
> than those that would be plausible reasons to think a node is
> unhealthy.
> This assumption *could* be violated by the gossip-to-seed
> feature. There is an argument to avoid gossip-to-seed for other
> reasons (see CASSANDRA-3829), but this is a concrete case in which the
> gossip-to-seed could cause a negative side-effect of the general kind
> mentioned in CASSANDRA-3829 (see notes at end about not case w/o seeds
> not being continuously tested). Normally, due to gossip to seed,
> everyone essentially sees latest information within very few hart
> beats (assuming only 2-3 seeds). But should all seeds be down,
> suddenly we flip a switch and start relying on generalized propagation
> in the gossip system, rather than the seed special case.
> The potential problem I forese here is that if the average propagation
> time suddenly spikes when all seeds become available, it could cause
> bogus flapping of nodes into down state.
> In order to test this, I deployeda ~ 180 node cluster with a version
> that logs heartbet information on each interpret(), similar to:
>  INFO [GossipTasks:1] 2012-02-01 23:29:58,746 FailureDetector.java (line 187) ep /XXX.XXX.XXX.XXX is at phi 0.0019521638443084342, last interval 7.0, mean is 1557.2777777777778
> It turns out that, at least at 180 nodes, with 4 seed nodes, whether
> or not seeds are running *does not* seem to matter significantly. In
> both cases, the mean interval is around 1500 milliseconds.
> I don't feel I have a good grasp of whether this is incidental or
> guaranteed, and it would be good to at least empirically test
> propagation time w/o seeds at differnet cluster sizes; it's supposed
> to be un-affected by cluster size ({{RING_DELAY}} is static for this
> reason, is my understanding). Would be nice to see this be the case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3830) gossip-to-seeds is not obviously independent of failure detection algorithm

Posted by "Peter Schuller (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13207061#comment-13207061 ] 

Peter Schuller commented on CASSANDRA-3830:
-------------------------------------------

To clarify, the relation to failure detector isn't the absolute propagation delay - I am concerned with a sudden *change* in propagation delay (either average or outliers).
                
> gossip-to-seeds is not obviously independent of failure detection algorithm 
> ----------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3830
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3830
>             Project: Cassandra
>          Issue Type: Task
>          Components: Core
>            Reporter: Peter Schuller
>            Priority: Minor
>
> The failure detector, ignoring all the theory, boils down to an
> extremely simple algorithm. The FD keeps track of a sliding window (of
> 1000 currently) intervals of heartbeat for a given host. Meaning, we
> have a track record of the last 1000 times we saw an updated heartbeat
> for a host.
> At any given moment, a host has a score which is simply the time since
> the last heartbeat, over the *mean* interval in the sliding
> window. For historical reasons a simple scaling factor is applied to
> this prior to checking the phi conviction threshold.
> (CASSANDRA-2597 has details, but thanks to Paul's work there it's now
> trivial to understand what it does based on gut feeling)
> So in effect, a host is considered down if we haven't heard from it in
> some time which is significantly longer than the "average" time we
> expect to hear from it.
> This seems reasonable, but it does assume that under normal conditions
> the average time between heartbeats does not change for reasons other
> than those that would be plausible reasons to think a node is
> unhealthy.
> This assumption *could* be violated by the gossip-to-seed
> feature. There is an argument to avoid gossip-to-seed for other
> reasons (see CASSANDRA-3829), but this is a concrete case in which the
> gossip-to-seed could cause a negative side-effect of the general kind
> mentioned in CASSANDRA-3829 (see notes at end about not case w/o seeds
> not being continuously tested). Normally, due to gossip to seed,
> everyone essentially sees latest information within very few hart
> beats (assuming only 2-3 seeds). But should all seeds be down,
> suddenly we flip a switch and start relying on generalized propagation
> in the gossip system, rather than the seed special case.
> The potential problem I forese here is that if the average propagation
> time suddenly spikes when all seeds become available, it could cause
> bogus flapping of nodes into down state.
> In order to test this, I deployeda ~ 180 node cluster with a version
> that logs heartbet information on each interpret(), similar to:
>  INFO [GossipTasks:1] 2012-02-01 23:29:58,746 FailureDetector.java (line 187) ep /XXX.XXX.XXX.XXX is at phi 0.0019521638443084342, last interval 7.0, mean is 1557.2777777777778
> It turns out that, at least at 180 nodes, with 4 seed nodes, whether
> or not seeds are running *does not* seem to matter significantly. In
> both cases, the mean interval is around 1500 milliseconds.
> I don't feel I have a good grasp of whether this is incidental or
> guaranteed, and it would be good to at least empirically test
> propagation time w/o seeds at differnet cluster sizes; it's supposed
> to be un-affected by cluster size ({{RING_DELAY}} is static for this
> reason, is my understanding). Would be nice to see this be the case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3830) gossip-to-seeds is not obviously independent of failure detection algorithm

Posted by "Peter Schuller (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13206745#comment-13206745 ] 

Peter Schuller commented on CASSANDRA-3830:
-------------------------------------------

I had written a response here, but I assume I must have failed to submit it and lost track of the browser tab or something.

What you describe is not the behavior of the Gossiper. It picks a random node to gossip to. Then, unless the node *happened* to also be a seed node, it picks a random *seed node* to gossip to *as well*.

The "less than number of seeds" you're mentioning is presumably due to the comments in the code before the gossip to seed:

{code}
                    /* Gossip to a seed if we did not do so above, or we have seen less nodes
                       than there are seeds.  This prevents partitions where each group of nodes
                       is only gossiping to a subset of the seeds.

                       The most straightforward check would be to check that all the seeds have been
                       verified either as live or unreachable.  To avoid that computation each round,
                       we reason that:

                       either all the live nodes are seeds, in which case non-seeds that come online
                       will introduce themselves to a member of the ring by definition,

                       or there is at least one non-seed node in the list, in which case eventually
                       someone will gossip to it, and then do a gossip to a random seed from the
                       gossipedToSeed check.

                       See CASSANDRA-150 for more exposition. */
                    if (!gossipedToSeed || liveEndpoints.size() < seeds.size())
                        doGossipToSeed(prod);
{code}

If you look carefully though, you'll see that the number of live endpoints is *only* relevant in the sense that it forces *always* gossiping to a seed even if we already did. In the normal case of almost all cases, we have more live endpoints than seeds, and we'll still gossip to seeds because {{!gossipedToSeed}}.


                
> gossip-to-seeds is not obviously independent of failure detection algorithm 
> ----------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3830
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3830
>             Project: Cassandra
>          Issue Type: Task
>          Components: Core
>            Reporter: Peter Schuller
>            Priority: Minor
>
> The failure detector, ignoring all the theory, boils down to an
> extremely simple algorithm. The FD keeps track of a sliding window (of
> 1000 currently) intervals of heartbeat for a given host. Meaning, we
> have a track record of the last 1000 times we saw an updated heartbeat
> for a host.
> At any given moment, a host has a score which is simply the time since
> the last heartbeat, over the *mean* interval in the sliding
> window. For historical reasons a simple scaling factor is applied to
> this prior to checking the phi conviction threshold.
> (CASSANDRA-2597 has details, but thanks to Paul's work there it's now
> trivial to understand what it does based on gut feeling)
> So in effect, a host is considered down if we haven't heard from it in
> some time which is significantly longer than the "average" time we
> expect to hear from it.
> This seems reasonable, but it does assume that under normal conditions
> the average time between heartbeats does not change for reasons other
> than those that would be plausible reasons to think a node is
> unhealthy.
> This assumption *could* be violated by the gossip-to-seed
> feature. There is an argument to avoid gossip-to-seed for other
> reasons (see CASSANDRA-3829), but this is a concrete case in which the
> gossip-to-seed could cause a negative side-effect of the general kind
> mentioned in CASSANDRA-3829 (see notes at end about not case w/o seeds
> not being continuously tested). Normally, due to gossip to seed,
> everyone essentially sees latest information within very few hart
> beats (assuming only 2-3 seeds). But should all seeds be down,
> suddenly we flip a switch and start relying on generalized propagation
> in the gossip system, rather than the seed special case.
> The potential problem I forese here is that if the average propagation
> time suddenly spikes when all seeds become available, it could cause
> bogus flapping of nodes into down state.
> In order to test this, I deployeda ~ 180 node cluster with a version
> that logs heartbet information on each interpret(), similar to:
>  INFO [GossipTasks:1] 2012-02-01 23:29:58,746 FailureDetector.java (line 187) ep /XXX.XXX.XXX.XXX is at phi 0.0019521638443084342, last interval 7.0, mean is 1557.2777777777778
> It turns out that, at least at 180 nodes, with 4 seed nodes, whether
> or not seeds are running *does not* seem to matter significantly. In
> both cases, the mean interval is around 1500 milliseconds.
> I don't feel I have a good grasp of whether this is incidental or
> guaranteed, and it would be good to at least empirically test
> propagation time w/o seeds at differnet cluster sizes; it's supposed
> to be un-affected by cluster size ({{RING_DELAY}} is static for this
> reason, is my understanding). Would be nice to see this be the case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-3830) gossip-to-seeds is not obviously independent of failure detection algorithm

Posted by "Brandon Williams (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-3830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13207110#comment-13207110 ] 

Brandon Williams commented on CASSANDRA-3830:
---------------------------------------------

CASSANDRA-617 may be of interest then (though this is when gossip was old and busted; udp and whatnot)

bq. It's trivial to see that if we have a bunch of N servers all gossiping to a small set of 2-4 servers, propagation delay is not going to be a major problem as long as at least one of those are up

Right, gossiping to a seed every round actually becomes a bit of an optimization in this regard, but isn't strictly necessary.
                
> gossip-to-seeds is not obviously independent of failure detection algorithm 
> ----------------------------------------------------------------------------
>
>                 Key: CASSANDRA-3830
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3830
>             Project: Cassandra
>          Issue Type: Task
>          Components: Core
>            Reporter: Peter Schuller
>            Priority: Minor
>
> The failure detector, ignoring all the theory, boils down to an
> extremely simple algorithm. The FD keeps track of a sliding window (of
> 1000 currently) intervals of heartbeat for a given host. Meaning, we
> have a track record of the last 1000 times we saw an updated heartbeat
> for a host.
> At any given moment, a host has a score which is simply the time since
> the last heartbeat, over the *mean* interval in the sliding
> window. For historical reasons a simple scaling factor is applied to
> this prior to checking the phi conviction threshold.
> (CASSANDRA-2597 has details, but thanks to Paul's work there it's now
> trivial to understand what it does based on gut feeling)
> So in effect, a host is considered down if we haven't heard from it in
> some time which is significantly longer than the "average" time we
> expect to hear from it.
> This seems reasonable, but it does assume that under normal conditions
> the average time between heartbeats does not change for reasons other
> than those that would be plausible reasons to think a node is
> unhealthy.
> This assumption *could* be violated by the gossip-to-seed
> feature. There is an argument to avoid gossip-to-seed for other
> reasons (see CASSANDRA-3829), but this is a concrete case in which the
> gossip-to-seed could cause a negative side-effect of the general kind
> mentioned in CASSANDRA-3829 (see notes at end about not case w/o seeds
> not being continuously tested). Normally, due to gossip to seed,
> everyone essentially sees latest information within very few hart
> beats (assuming only 2-3 seeds). But should all seeds be down,
> suddenly we flip a switch and start relying on generalized propagation
> in the gossip system, rather than the seed special case.
> The potential problem I forese here is that if the average propagation
> time suddenly spikes when all seeds become available, it could cause
> bogus flapping of nodes into down state.
> In order to test this, I deployeda ~ 180 node cluster with a version
> that logs heartbet information on each interpret(), similar to:
>  INFO [GossipTasks:1] 2012-02-01 23:29:58,746 FailureDetector.java (line 187) ep /XXX.XXX.XXX.XXX is at phi 0.0019521638443084342, last interval 7.0, mean is 1557.2777777777778
> It turns out that, at least at 180 nodes, with 4 seed nodes, whether
> or not seeds are running *does not* seem to matter significantly. In
> both cases, the mean interval is around 1500 milliseconds.
> I don't feel I have a good grasp of whether this is incidental or
> guaranteed, and it would be good to at least empirically test
> propagation time w/o seeds at differnet cluster sizes; it's supposed
> to be un-affected by cluster size ({{RING_DELAY}} is static for this
> reason, is my understanding). Would be nice to see this be the case.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira