You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Brandon Williams (JIRA)" <ji...@apache.org> on 2013/11/06 23:26:21 UTC
[jira] [Comment Edited] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes

    [ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13815379#comment-13815379 ] 

Brandon Williams edited comment on CASSANDRA-6127 at 11/6/13 10:25 PM:
-----------------------------------------------------------------------

At this point, I think we should:

* see if the flapping happens with vnodes in 1.2 head (maybe Quentin already knows from his last test)
* see if the flapping happens without vnodes in 1.2 head but the same number of nodes

Because if sum() in ArrivalWindow is burning the most CPU in the Gossiper task (note: not bottlenecking, each call was at most ~3ms, there were just lots of them) then that means that the problem is no longer tied to vnodes (if it ever was, since sum is per-node, not per-token) and we should probably open a new ticket (can't start a cluster of size >=X all at once, or similar) and discuss there.  We know that clusters much larger than any discussed on this ticket exist, but I don't think any of them have all rebooted at once.


was (Author: brandon.williams):
At this point, I think we should:

* see if the flapping happens with vnodes (maybe Quentin already knows from his last test)
* see if the flapping happens without vnodes but the same number of nodes

Because if sum() in ArrivalWindow is burning the most CPU in the Gossiper task (note: not bottlenecking, each call was at most ~3ms, there were just lots of them) then that means that the problem is no longer tied to vnodes (if it ever was, since sum is per-node, not per-token) and we should probably open a new ticket (can't start a cluster of size >=X all at once, or similar) and discuss there.  We know that clusters much larger than any discussed on this ticket exist, but I don't think any of them have all rebooted at once.

> vnodes don't scale to hundreds of nodes
> ---------------------------------------
>
>                 Key: CASSANDRA-6127
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: Any cluster that has vnodes and consists of hundreds of physical nodes.
>            Reporter: Tupshin Harper
>            Assignee: Jonathan Ellis
>         Attachments: 6000vnodes.patch, AdjustableGossipPeriod.patch, delayEstimatorUntilStatisticallyValid.patch
>
>
> There are a lot of gossip-related issues related to very wide clusters that also have vnodes enabled. Let's use this ticket as a master in case there are sub-tickets.
> The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances. Each node configured with 32 vnodes.
> Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes or less. 
> With vnodes, nodes are reporting constant up/down flapping messages with no external load on the cluster. After a couple of hours, they were still flapping, had very high cpu load, and the cluster never looked like it was going to stabilize or be useful for traffic.



--
This message was sent by Atlassian JIRA
(v6.1#6144)