You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Daniel Kluesing <dk...@bluekai.com> on 2010/03/25 16:37:13 UTC

Ring management and load balance

I wanted to check my understanding of the load balance operation. Let's say I have 5 nodes, each of them has been assigned at startup 1/5 of the ring, and the load is equal across them (say using random partitioner). The load on the cluster gets high, so I add a sixth server. During bootstrap, the new server will pick the existing server with the highest load, and take half the load from that server. After boot strap, I would end up with 4 servers with 1/5 of the ring each, and 2 servers with 1/10 of the ring each - is this correct? I'll get hotspots unless I double the number of nodes?
				
Really I would want adding a sixth server to result in six machines with 1/6 of the load taken evenly from the existing nodes.  If I understand - and correct me if I'm wrong -, the core of this is that each server is assigned one token, while in a system like dynamo, a server is assigned multiple tokens around the ring. Is there any benefit to only assigning one token? Has anyone worked on assigning a server multiple tokens, or is there some other unrelated way to get more even load distribution when adding a node?

Re: Ring management and load balance

Posted by Jonathan Ellis <jb...@gmail.com>.

On Thu, Mar 25, 2010 at 11:40 AM, Jeremy Dunck <jd...@gmail.com> wrote:
> On Thu, Mar 25, 2010 at 10:56 AM, Jonathan Ellis <jb...@gmail.com> wrote:
>> The advantage to doing it the way Cassandra does is that you can keep
>> keys sorted with OrderPreservingPartitioner for range scans.  grabbing
>> one token of many from each node in the ring would prohibit that.
>>
>> So we rely on active load balancing to get to a "good enough" balance,
>> say within 50%.  It doesn't need to be perfect.
>
> Isn't this also only a real [issue with] small clusters?

Right, good point.

Re: Ring management and load balance

Posted by Jeremy Dunck <jd...@gmail.com>.

On Thu, Mar 25, 2010 at 10:56 AM, Jonathan Ellis <jb...@gmail.com> wrote:
> The advantage to doing it the way Cassandra does is that you can keep
> keys sorted with OrderPreservingPartitioner for range scans.  grabbing
> one token of many from each node in the ring would prohibit that.
>
> So we rely on active load balancing to get to a "good enough" balance,
> say within 50%.  It doesn't need to be perfect.

Isn't this also only a real resource was with small clusters?  As you
add nodes, they all get smoothed out (by halving the hottest node),
and if you have a 50-node cluster, 1/50 to 1/100 is a relatively small
difference.

You certainly wouldn't want to shuffle all the data when adding a node
for perfect balance...

Re: Ring management and load balance

Posted by Roland Hänel <ro...@haenel.me>.

But this

26.03.2010 22:29 schrieb am "Rob Coli" <rc...@digg.com>:

On 3/26/10 1:36 PM, Roland Hänel wrote:
>
> If I was going to write such a tool: do you think the th...
The JMX interface exposes an Attribute which seems appropriate to this use.
It is called "TotalDiskSpaceUsed," and is available on a per-columnfamily
basis. Given a CF called "Users" in a Keyspace called "MyKeyspace", it is
accessible at :

"
org.apache.cassandra.db:type=ColumnFamilyStores,keyspace=MyKeyspace,columnfamily=Users
"

Unfortunately, I have yet to document the per-CF "ColumnFamilyStores" and
"Caches" JMX interface Attributes and Operations, including this one. But
when I do so, I will do it here :

http://wiki.apache.org/cassandra/JmxInterface

=Rob

Re: Ring management and load balance

Posted by Roland Hänel <ro...@haenel.me>.

Sorry for the last mail,  hit the wrong button.  This JMX property gives a
per-CF granularity, right?

I think it doesn't solve the problem completely here because the problem of
key load-balancing effectively demands for a per-key granularity. But this
could help statistical sampling.

Roland

26.03.2010 22:29 schrieb am "Rob Coli" <rc...@digg.com>:

On 3/26/10 1:36 PM, Roland Hänel wrote:
>
> If I was going to write such a tool: do you think the th...
The JMX interface exposes an Attribute which seems appropriate to this use.
It is called "TotalDiskSpaceUsed," and is available on a per-columnfamily
basis. Given a CF called "Users" in a Keyspace called "MyKeyspace", it is
accessible at :

"
org.apache.cassandra.db:type=ColumnFamilyStores,keyspace=MyKeyspace,columnfamily=Users
"

Unfortunately, I have yet to document the per-CF "ColumnFamilyStores" and
"Caches" JMX interface Attributes and Operations, including this one. But
when I do so, I will do it here :

http://wiki.apache.org/cassandra/JmxInterface

=Rob

Re: Ring management and load balance

Posted by Rob Coli <rc...@digg.com>.

On 3/26/10 1:36 PM, Roland Hänel wrote:
> If I was going to write such a tool: do you think the thrift API
> provides the necessary information? I think with the RandomPartitioner
> you cannot scan all your rows to actually find out how big certain
> ranges of rows are. And even with the OPP (that is the major target for
> this kind of tool, for sure) you would have to fetch all row's content
> just to find out how large it is, right?

The JMX interface exposes an Attribute which seems appropriate to this 
use. It is called "TotalDiskSpaceUsed," and is available on a 
per-columnfamily basis. Given a CF called "Users" in a Keyspace called 
"MyKeyspace", it is accessible at :

"
org.apache.cassandra.db:type=ColumnFamilyStores,keyspace=MyKeyspace,columnfamily=Users
"

Unfortunately, I have yet to document the per-CF "ColumnFamilyStores" 
and "Caches" JMX interface Attributes and Operations, including this 
one. But when I do so, I will do it here :

http://wiki.apache.org/cassandra/JmxInterface

=Rob

Re: Ring management and load balance

Posted by Roland Hänel <ro...@haenel.me>.

Mike,

If you have the assumption that your rows are roughly equal in size (at
least statistcally), then you could also just take a node's total load (this
is exposed via Jmx) and divide by the amount of keys/rows on that node. Not
sure how to get the latter, but shouldn't be such a big deal to integrate in
JMX if not already there.

Roland

26.03.2010 22:36 schrieb am "Mike Malone" <mi...@simplegeo.com>:

2010/3/26 Roland Hänel <ro...@haenel.me>

>
> Jonathan,
>
> I agree with your idea about a tool that could 'propose' good token
choices for op...
With the random partitioner there's no need to suggest a token. The key
space is statistically random so you should be able to just split 2^128 into
equal sized segments and get fairly equal storage load. Your read / write
load could get out of whack if you have hot spots and stuff, I guess. But
for a large distributed data set I think that's unlikely.

For order preserving partitioners it's harder. We've been thinking about
this issue at SimpleGeo and were planning on implementing an algorithm that
could determine the median row key statistically without having to inspect
every key. Basically, it would pull a random sample of row keys (maybe from
the Index file?) and then determine the median of that sample. Thoughts?

Mike

Re: Ring management and load balance

Posted by Jonathan Ellis <jb...@gmail.com>.

On Fri, Mar 26, 2010 at 4:35 PM, Mike Malone <mi...@simplegeo.com> wrote:
> With the random partitioner there's no need to suggest a token. The key
> space is statistically random so you should be able to just split 2^128 into
> equal sized segments and get fairly equal storage load. Your read / write
> load could get out of whack if you have hot spots and stuff, I guess. But
> for a large distributed data set I think that's unlikely.
> For order preserving partitioners it's harder. We've been thinking about
> this issue at SimpleGeo and were planning on implementing an algorithm that
> could determine the median row key statistically without having to inspect
> every key. Basically, it would pull a random sample of row keys (maybe from
> the Index file?) and then determine the median of that sample. Thoughts?

That's exactly what the bootstrap token calculation does for OPP,
after picking the most-loaded node to talk to.  You could expose that
over JMX, or generalize it to giving say 100 tokens, evenly spaced, so
the tool could estimate position to within 1%.

-Jonathan

Re: Ring management and load balance

Posted by Mike Malone <mi...@simplegeo.com>.

2010/3/26 Roland Hänel <ro...@haenel.me>

> Jonathan,
>
> I agree with your idea about a tool that could 'propose' good token choices
> for optimal load-balancing.
>
> If I was going to write such a tool: do you think the thrift API provides
> the necessary information? I think with the RandomPartitioner you cannot
> scan all your rows to actually find out how big certain ranges of rows are.
> And even with the OPP (that is the major target for this kind of tool, for
> sure) you would have to fetch all row's content just to find out how large
> it is, right?
>

With the random partitioner there's no need to suggest a token. The key
space is statistically random so you should be able to just split 2^128 into
equal sized segments and get fairly equal storage load. Your read / write
load could get out of whack if you have hot spots and stuff, I guess. But
for a large distributed data set I think that's unlikely.

For order preserving partitioners it's harder. We've been thinking about
this issue at SimpleGeo and were planning on implementing an algorithm that
could determine the median row key statistically without having to inspect
every key. Basically, it would pull a random sample of row keys (maybe from
the Index file?) and then determine the median of that sample. Thoughts?

Mike

Re: Ring management and load balance

Posted by Roland Hänel <ro...@haenel.me>.

Jonathan,

I agree with your idea about a tool that could 'propose' good token choices
for optimal load-balancing.

If I was going to write such a tool: do you think the thrift API provides
the necessary information? I think with the RandomPartitioner you cannot
scan all your rows to actually find out how big certain ranges of rows are.
And even with the OPP (that is the major target for this kind of tool, for
sure) you would have to fetch all row's content just to find out how large
it is, right?

Greetings,
Roland

25.03.2010 22:28 schrieb am "Jonathan Ellis" <jb...@gmail.com>:

One problem is if the heaviest node is next to a node that's is
lighter than average, instead of heavier.  Then if the new node takes
extra from the heaviest, say 75% instead of just 1/2, and then we take
1/2 of the heaviest's neighbor and put it on the heaviest, you made
that lighter-than-average node even lighter.

Could you move 1/2, 1/4, etc. only until you get to a node lighter
than average?  Probably.  But I'm not sure if it's a big enough win to
justify the the complexity.

Probably a better solution would be a tool where you tell it "I want
to add N nodes to my cluster, analyzes the load factors and tell me
what tokens to add them with, and what additional moves to make to get
me within M% of equal loads, with the minimum amount of data
movement."

-Jonathan


On Thu, Mar 25, 2010 at 1:52 PM, Jeremy Dunck <jd...@gmail.com> wrote:
> On Thu, Mar 25, 2010 at 1...

RE: Ring management and load balance

Posted by Daniel Kluesing <dk...@bluekai.com>.

I agree it's only a problem with 'small' clusters - but it seems like 'small' is 'most users'? Even with 10 nodes it looks like a pretty big imbalance if I add an 11th node, and don't add the other 9 or move a large part of the ring. Or in practice have folks not had trouble with incremental scalability?

-----Original Message-----
From: Jonathan Ellis [mailto:jbellis@gmail.com] 
Sent: Thursday, March 25, 2010 2:27 PM
To: user@cassandra.apache.org
Subject: Re: Ring management and load balance

One problem is if the heaviest node is next to a node that's is
lighter than average, instead of heavier.  Then if the new node takes
extra from the heaviest, say 75% instead of just 1/2, and then we take
1/2 of the heaviest's neighbor and put it on the heaviest, you made
that lighter-than-average node even lighter.

Could you move 1/2, 1/4, etc. only until you get to a node lighter
than average?  Probably.  But I'm not sure if it's a big enough win to
justify the the complexity.

Probably a better solution would be a tool where you tell it "I want
to add N nodes to my cluster, analyzes the load factors and tell me
what tokens to add them with, and what additional moves to make to get
me within M% of equal loads, with the minimum amount of data
movement."

-Jonathan

On Thu, Mar 25, 2010 at 1:52 PM, Jeremy Dunck <jd...@gmail.com> wrote:
> On Thu, Mar 25, 2010 at 1:26 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>> Pretty much everything assumes that there is a 1:1 correspondence
>> between IP and Token.  It's probably in the ballpark of "one month to
>> code, two to get the bugs out."  Gossip is one of the trickier parts
>> of our code base, and this would be all over that.  The actual storage
>> system changes would be simpler I think.
>
> What if adding a node shifted down-ring tokens less and less?  If
> adding node N+1, it shifts the first N/2^x, the second N/2^2x, the
> third N/2^3x, etc, so that a fixed number of nodes are shifted, but
> the bump is smoothed out?  Tokens stay 1:1.
>
> I'm talking out of my league here -- haven't actually run a cluster
> yet -- so probably a dumb idea.  :-)
>

RE: Ring management and load balance

Posted by Stu Hood <st...@rackspace.com>.

It is much more likely that you always increase your cluster in size by a certain large percentage. With a 10 node cluster, you are likely to add 5 nodes at a time, and with a 100 node cluster you'll probably add 25 to 50 per batch.

-----Original Message-----
From: "Daniel Kluesing" <dk...@bluekai.com>
Sent: Thursday, March 25, 2010 6:59pm
To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Subject: RE: Ring management and load balance

I agree it's only a problem with 'small' clusters - but it seems like 'small' is 'most users'? Even with 10 nodes it looks like a pretty big imbalance if I add an 11th node, and don't add the other 9 or move a large part of the ring. Or in practice have folks not had trouble with incremental scalability?

-----Original Message-----
From: Jonathan Ellis [mailto:jbellis@gmail.com] 
Sent: Thursday, March 25, 2010 2:27 PM
To: user@cassandra.apache.org
Subject: Re: Ring management and load balance

One problem is if the heaviest node is next to a node that's is
lighter than average, instead of heavier.  Then if the new node takes
extra from the heaviest, say 75% instead of just 1/2, and then we take
1/2 of the heaviest's neighbor and put it on the heaviest, you made
that lighter-than-average node even lighter.

Could you move 1/2, 1/4, etc. only until you get to a node lighter
than average?  Probably.  But I'm not sure if it's a big enough win to
justify the the complexity.

Probably a better solution would be a tool where you tell it "I want
to add N nodes to my cluster, analyzes the load factors and tell me
what tokens to add them with, and what additional moves to make to get
me within M% of equal loads, with the minimum amount of data
movement."

-Jonathan

On Thu, Mar 25, 2010 at 1:52 PM, Jeremy Dunck <jd...@gmail.com> wrote:
> On Thu, Mar 25, 2010 at 1:26 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>> Pretty much everything assumes that there is a 1:1 correspondence
>> between IP and Token.  It's probably in the ballpark of "one month to
>> code, two to get the bugs out."  Gossip is one of the trickier parts
>> of our code base, and this would be all over that.  The actual storage
>> system changes would be simpler I think.
>
> What if adding a node shifted down-ring tokens less and less?  If
> adding node N+1, it shifts the first N/2^x, the second N/2^2x, the
> third N/2^3x, etc, so that a fixed number of nodes are shifted, but
> the bump is smoothed out?  Tokens stay 1:1.
>
> I'm talking out of my league here -- haven't actually run a cluster
> yet -- so probably a dumb idea.  :-)
>

Re: Ring management and load balance

Posted by Jonathan Ellis <jb...@gmail.com>.

One problem is if the heaviest node is next to a node that's is
lighter than average, instead of heavier.  Then if the new node takes
extra from the heaviest, say 75% instead of just 1/2, and then we take
1/2 of the heaviest's neighbor and put it on the heaviest, you made
that lighter-than-average node even lighter.

Could you move 1/2, 1/4, etc. only until you get to a node lighter
than average?  Probably.  But I'm not sure if it's a big enough win to
justify the the complexity.

Probably a better solution would be a tool where you tell it "I want
to add N nodes to my cluster, analyzes the load factors and tell me
what tokens to add them with, and what additional moves to make to get
me within M% of equal loads, with the minimum amount of data
movement."

-Jonathan

On Thu, Mar 25, 2010 at 1:52 PM, Jeremy Dunck <jd...@gmail.com> wrote:
> On Thu, Mar 25, 2010 at 1:26 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>> Pretty much everything assumes that there is a 1:1 correspondence
>> between IP and Token.  It's probably in the ballpark of "one month to
>> code, two to get the bugs out."  Gossip is one of the trickier parts
>> of our code base, and this would be all over that.  The actual storage
>> system changes would be simpler I think.
>
> What if adding a node shifted down-ring tokens less and less?  If
> adding node N+1, it shifts the first N/2^x, the second N/2^2x, the
> third N/2^3x, etc, so that a fixed number of nodes are shifted, but
> the bump is smoothed out?  Tokens stay 1:1.
>
> I'm talking out of my league here -- haven't actually run a cluster
> yet -- so probably a dumb idea.  :-)
>

Re: Ring management and load balance

Posted by Jeremy Dunck <jd...@gmail.com>.

On Thu, Mar 25, 2010 at 1:26 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> Pretty much everything assumes that there is a 1:1 correspondence
> between IP and Token.  It's probably in the ballpark of "one month to
> code, two to get the bugs out."  Gossip is one of the trickier parts
> of our code base, and this would be all over that.  The actual storage
> system changes would be simpler I think.

What if adding a node shifted down-ring tokens less and less?  If
adding node N+1, it shifts the first N/2^x, the second N/2^2x, the
third N/2^3x, etc, so that a fixed number of nodes are shifted, but
the bump is smoothed out?  Tokens stay 1:1.

I'm talking out of my league here -- haven't actually run a cluster
yet -- so probably a dumb idea.  :-)

Re: Ring management and load balance

Posted by Jonathan Ellis <jb...@gmail.com>.

On Thu, Mar 25, 2010 at 1:17 PM, Mike Malone <mi...@simplegeo.com> wrote:
> On Thu, Mar 25, 2010 at 9:56 AM, Jonathan Ellis <jb...@gmail.com> wrote:
>>
>> The advantage to doing it the way Cassandra does is that you can keep
>> keys sorted with OrderPreservingPartitioner for range scans.  grabbing
>> one token of many from each node in the ring would prohibit that.
>>
>> So we rely on active load balancing to get to a "good enough" balance,
>> say within 50%.  It doesn't need to be perfect.
>
> This makes sense for the order preserving partitioner. But for the random
> partitioner multiple tokens per node would certainly make balancing
> easier... I haven't dug into that bit of the Cassandra implementation yet.
> Would it be very difficult to support both modes of operation?

I guess that depends on what your threshold of "very" difficult is,
doesn't it. :)

Pretty much everything assumes that there is a 1:1 correspondence
between IP and Token.  It's probably in the ballpark of "one month to
code, two to get the bugs out."  Gossip is one of the trickier parts
of our code base, and this would be all over that.  The actual storage
system changes would be simpler I think.

-Jonathan

Re: Ring management and load balance

Posted by Mike Malone <mi...@simplegeo.com>.

On Thu, Mar 25, 2010 at 9:56 AM, Jonathan Ellis <jb...@gmail.com> wrote:

> The advantage to doing it the way Cassandra does is that you can keep
> keys sorted with OrderPreservingPartitioner for range scans.  grabbing
> one token of many from each node in the ring would prohibit that.
>
> So we rely on active load balancing to get to a "good enough" balance,
> say within 50%.  It doesn't need to be perfect.
>

This makes sense for the order preserving partitioner. But for the random
partitioner multiple tokens per node would certainly make balancing
easier... I haven't dug into that bit of the Cassandra implementation yet.
Would it be very difficult to support both modes of operation?

For what it's worth, we've already seen annoying behavior when adding nodes
to the cluster. It's obviously true that the absolute size of partitions
becomes smaller as the cluster grows, but if your relatively balanced 100
node cluster is at, say, 70% capacity and you add 10 more nodes you would
presumably want this additional capacity to be evenly distributed. And right
now that's pretty much impossible to do without rebalancing the entire
cluster.

Mike

Re: Ring management and load balance

Posted by Jonathan Ellis <jb...@gmail.com>.

The advantage to doing it the way Cassandra does is that you can keep
keys sorted with OrderPreservingPartitioner for range scans.  grabbing
one token of many from each node in the ring would prohibit that.

So we rely on active load balancing to get to a "good enough" balance,
say within 50%.  It doesn't need to be perfect.

On Thu, Mar 25, 2010 at 10:37 AM, Daniel Kluesing <dk...@bluekai.com> wrote:
> I wanted to check my understanding of the load balance operation. Let's say I have 5 nodes, each of them has been assigned at startup 1/5 of the ring, and the load is equal across them (say using random partitioner). The load on the cluster gets high, so I add a sixth server. During bootstrap, the new server will pick the existing server with the highest load, and take half the load from that server. After boot strap, I would end up with 4 servers with 1/5 of the ring each, and 2 servers with 1/10 of the ring each - is this correct? I'll get hotspots unless I double the number of nodes?
>
> Really I would want adding a sixth server to result in six machines with 1/6 of the load taken evenly from the existing nodes.  If I understand - and correct me if I'm wrong -, the core of this is that each server is assigned one token, while in a system like dynamo, a server is assigned multiple tokens around the ring. Is there any benefit to only assigning one token? Has anyone worked on assigning a server multiple tokens, or is there some other unrelated way to get more even load distribution when adding a node?
>