You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "Jonathan Ellis (JIRA)" <ji...@apache.org> on 2009/05/07 18:13:30 UTC

[jira] Created: (CASSANDRA-150) multiple seeds (only when seed count = node count?) can cause cluster partition

multiple seeds (only when seed count = node count?) can cause cluster partition
-------------------------------------------------------------------------------

                 Key: CASSANDRA-150
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-150
             Project: Cassandra
          Issue Type: Bug
            Reporter: Jonathan Ellis
             Fix For: 0.4


happens fairly frequently on my test cluster of 5 nodes.  (i normally restart all nodes at once when updating the code.  haven't tested w/ restarting one machine at a time.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-150) multiple seeds (only when seed count = node count?) can cause cluster partition

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-150:
-------------------------------------

    Fix Version/s: 0.5

> multiple seeds (only when seed count = node count?) can cause cluster partition
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-150
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-150
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jaakko Laine
>            Priority: Minor
>             Fix For: 0.5
>
>         Attachments: 150.patch
>
>
> happens fairly frequently on my test cluster of 5 nodes.  (i normally restart all nodes at once when updating the code.  haven't tested w/ restarting one machine at a time.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-150) multiple seeds (only when seed count = node count?) can cause cluster partition

Posted by "Jaakko Laine (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782713#action_12782713 ] 

Jaakko Laine commented on CASSANDRA-150:
----------------------------------------

This kind of partition cannot happen if there are less than four seeds or there is even a single non-seed node. There must be enough seeds to form at least two separate network closures of at least two seeds each. If there has been a problem with 3/4 cluster, it must be different from this as there are two preconditions that are not met.

Gossip rule #1 sends gossip to a live node and rule #3 sends to a random seed if the node in #1 was not seed. If there is even a single non-seed node, it will trigger gossip to a random seed every time gossip is sent to it. Eventually this will break the network closures. What the patch basically does is it aggressively searches for seeds as long as it has found at least as many nodes as there are seeds. It does not matter even if this does not include all seeds, as that means there are non-seeds in liveEndpoints, which triggers search for random seed every time gossip is sent to it. So basically this is just to help Gossiper to get started, not to find all seeds. Whether it finds all seeds or at least one non-seed does not matter, it can continue from there.

Now of course the "correct" checks for this condition would be to on each gossip round check (1) whether liveEndpoints and unreachableEndpoints include all seeds or (2) if liveEndpoints includes at least one non-seed. However, putting these checks on the normal execution path only for the sake of one special case does not appeal to me, so decided to add this simple check instead.

Now that I think of it, there is one extremely special case that still could cause a partition: cluster of 4 seeds and 2 non-seeds. First 2 seeds and 2 non-seeds come online -> everybody is happy as cluster size is the same as number of seeds. Now both seeds go down, and then the other two seeds come up. Again everybody is happy. Now suppose the two non-seeds go down, and after that the two original seeds come up simultaneously, and happen to choose each other from the list of random seeds. In this case all seeds will send gossip only to the other seed, as they have 2 nodes in unreachableEndpoint, which makes the total number of seen nodes equal number of seeds. To avoid this, we might relax the condition a bit and send gossip to a seed if number of liveEndpoints is less than seeds (that is, ignore unreachableEndpoints). This modification would take care of the scenario above, but don't know if it is worth the trouble. If either of the non-seeds recovers (or one of the seeds goes down), this deadlock will be broken.


> multiple seeds (only when seed count = node count?) can cause cluster partition
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-150
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-150
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Priority: Minor
>         Attachments: 150.patch
>
>
> happens fairly frequently on my test cluster of 5 nodes.  (i normally restart all nodes at once when updating the code.  haven't tested w/ restarting one machine at a time.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-150) multiple seeds (only when seed count = node count?) can cause cluster partition

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-150:
-------------------------------------

    Priority: Critical  (was: Major)

> multiple seeds (only when seed count = node count?) can cause cluster partition
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-150
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-150
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jonathan Ellis
>            Priority: Critical
>             Fix For: 0.4
>
>
> happens fairly frequently on my test cluster of 5 nodes.  (i normally restart all nodes at once when updating the code.  haven't tested w/ restarting one machine at a time.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-150) multiple seeds (only when seed count = node count?) can cause cluster partition

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-150:
-------------------------------------

    Priority: Minor  (was: Critical)

> multiple seeds (only when seed count = node count?) can cause cluster partition
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-150
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-150
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Priority: Minor
>
> happens fairly frequently on my test cluster of 5 nodes.  (i normally restart all nodes at once when updating the code.  haven't tested w/ restarting one machine at a time.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-150) multiple seeds (only when seed count = node count?) can cause cluster partition

Posted by "Jaakko Laine (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782718#action_12782718 ] 

Jaakko Laine commented on CASSANDRA-150:
----------------------------------------

It might indeed be better to check only if liveEndpoints.size < seeds.size (do not count unreachableEndpoints). This will cause a bit more unnecessary gossip to seeds in some special cases, but is perhaps better approach. Have to think about this a bit still.

> multiple seeds (only when seed count = node count?) can cause cluster partition
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-150
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-150
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Priority: Minor
>         Attachments: 150.patch
>
>
> happens fairly frequently on my test cluster of 5 nodes.  (i normally restart all nodes at once when updating the code.  haven't tested w/ restarting one machine at a time.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-150) multiple seeds (only when seed count = node count?) can cause cluster partition

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781952#action_12781952 ] 

Jonathan Ellis commented on CASSANDRA-150:
------------------------------------------

That makes total sense.  Nice work!

> multiple seeds (only when seed count = node count?) can cause cluster partition
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-150
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-150
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Priority: Minor
>
> happens fairly frequently on my test cluster of 5 nodes.  (i normally restart all nodes at once when updating the code.  haven't tested w/ restarting one machine at a time.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-150) multiple seeds (only when seed count = node count?) can cause cluster partition

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782875#action_12782875 ] 

Hudson commented on CASSANDRA-150:
----------------------------------

Integrated in Cassandra #269 (See [http://hudson.zones.apache.org/hudson/job/Cassandra/269/])
    send extra gossip to random seed as long as there are less nodes alive than seed nodes configured
patch by Jaakko Laine; reviewed by jbellis for 


> multiple seeds (only when seed count = node count?) can cause cluster partition
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-150
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-150
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Jaakko Laine
>            Priority: Minor
>             Fix For: 0.5
>
>         Attachments: 150.patch
>
>
> happens fairly frequently on my test cluster of 5 nodes.  (i normally restart all nodes at once when updating the code.  haven't tested w/ restarting one machine at a time.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-150) multiple seeds (only when seed count = node count?) can cause cluster partition

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12706971#action_12706971 ] 

Jonathan Ellis commented on CASSANDRA-150:
------------------------------------------

daishi's "Unable to find a live Endpoint we might be out of live nodes"  bug was almost certainly caused by the same thing.  (he has a 3-node cluster.)

for both of us switching to a single seed has fixed the issue for now.  (but ultimately we do want to support multiple seeds for redundancy.)

> multiple seeds (only when seed count = node count?) can cause cluster partition
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-150
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-150
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jonathan Ellis
>             Fix For: 0.4
>
>
> happens fairly frequently on my test cluster of 5 nodes.  (i normally restart all nodes at once when updating the code.  haven't tested w/ restarting one machine at a time.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-150) multiple seeds (only when seed count = node count?) can cause cluster partition

Posted by "Jaakko Laine (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jaakko Laine updated CASSANDRA-150:
-----------------------------------

    Attachment: 150.patch

Added extra condition to send gossip to random seed also if liveEndpoints.size + unreachableEndpoints.size is less than seeds.size. This will cause us to send same gossip to same seed twice occasionally (when the seed is live, but we have not yet seen enough nodes), but I think this is OK, as this is quite special case and will go away as soon as we've seen enough nodes.

Another option would be to add extra parameter excludeThisNode to sendGossip and not send gossip if random returns that address, but IMHO this option is messy and gains very little.


> multiple seeds (only when seed count = node count?) can cause cluster partition
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-150
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-150
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Priority: Minor
>         Attachments: 150.patch
>
>
> happens fairly frequently on my test cluster of 5 nodes.  (i normally restart all nodes at once when updating the code.  haven't tested w/ restarting one machine at a time.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-150) multiple seeds (only when seed count = node count?) can cause cluster partition

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782675#action_12782675 ] 

Jonathan Ellis commented on CASSANDRA-150:
------------------------------------------

I don't think this quite works -- e.g. the guy on the mailing list with 3 seeds in a 4 node cluster.  I think we have to make the check "have we seen all the seeds yet" rather than "have we seen as many nodes as there are seeds"

> multiple seeds (only when seed count = node count?) can cause cluster partition
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-150
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-150
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Priority: Minor
>         Attachments: 150.patch
>
>
> happens fairly frequently on my test cluster of 5 nodes.  (i normally restart all nodes at once when updating the code.  haven't tested w/ restarting one machine at a time.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-150) multiple seeds (only when seed count = node count?) can cause cluster partition

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782726#action_12782726 ] 

Jonathan Ellis commented on CASSANDRA-150:
------------------------------------------

I see, so to make the less-strong check be enough when seeds.size <= live node count we reason that:

either all the live nodes are seeds, in which case non-seeds that come online will introduce themselves to a member of the ring by definition, and become known in turn,

or there is at least one non-seed node in the list, in which case eventually someone will gossip to it, and then do a gossip to a random seed from the existing clause in the if statement.

> It might indeed be better to check only if liveEndpoints.size < seeds.size

yes, let's go with this.  better to do a little extra gossiping in corner cases than risk indefinite partitions.

> multiple seeds (only when seed count = node count?) can cause cluster partition
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-150
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-150
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Priority: Minor
>         Attachments: 150.patch
>
>
> happens fairly frequently on my test cluster of 5 nodes.  (i normally restart all nodes at once when updating the code.  haven't tested w/ restarting one machine at a time.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-150) multiple seeds (only when seed count = node count?) can cause cluster partition

Posted by "Jaakko Laine (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781937#action_12781937 ] 

Jaakko Laine commented on CASSANDRA-150:
----------------------------------------

Network partition may happen if (1) cluster size is at least four nodes, (2) all nodes are seeds and (3) at least two nodes boot "simultaneously".

Gossiping cycle works as follows:
(i) gossip to random live node
(ii) gossip to random unreachable node
(iii) if the node gossiped to at (i) was not seed, gossip to random seed

Suppose there are four nodes in the cluster: nodeA, nodeB, nodeC and nodeD, all of them seeds. Suppose they are all brought online at the same time. Following event sequence leads to partition:

(1) nodeA comes online. No live nodes (and no unreachable either, of course), so gossip to random seed. Let's suppose nodeA chooses nodeB. It sends nodeB gossip.
(2) nodeB gets nodeA's gossip and marks it live. It sends its own gossip, and since it has a live node (nodeA), it sends gossip according to gossip's first rule. nodeA is seed, so no gossip is sent to random seed at (iii).
(3) nodeC comes online. It has not seen other live nodes yet, so it will gossip to random seed. Let's suppose it chooses nodeD.
(4) nodeD comes online and sees nodeC's gossip. Since it now has a live node, it will send nodeC gossip according to the first rule. Since nodeC is seed, again no gossip is sent to random seed.

(there are other sequences as well, but basic idea is the same)

Now all nodes know of one live node, so they will always send gossip according to the first rule. Since this node is seed, they will never send gossip to random seed according to rule three. This will prevent them from finding rest of the cluster. One non-seed node will break this loop, as gossip sent to it will trigger gossip to random seed.

While investigating this, I noticed we might have caused some harm to scalability of gossip mechanism when we added two new application states for node movement. I'll fix this bug tomorrow when checking if there is a problem.


> multiple seeds (only when seed count = node count?) can cause cluster partition
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-150
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-150
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Priority: Minor
>
> happens fairly frequently on my test cluster of 5 nodes.  (i normally restart all nodes at once when updating the code.  haven't tested w/ restarting one machine at a time.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-150) multiple seeds (only when seed count = node count?) can cause cluster partition

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-150:
-------------------------------------

    Fix Version/s:     (was: 0.4)

> multiple seeds (only when seed count = node count?) can cause cluster partition
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-150
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-150
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jonathan Ellis
>            Priority: Critical
>
> happens fairly frequently on my test cluster of 5 nodes.  (i normally restart all nodes at once when updating the code.  haven't tested w/ restarting one machine at a time.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.