You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cassandra.apache.org by Jaakko <ro...@gmail.com> on 2010/02/09 07:51:07 UTC

loadbalance and different strategies

Hi,

Current implementation of loadbalance seems to work only for
RackUnaware. Problem is only RackUnaware creates a "flat" replica
space where all nodes are equal. This is not true for other
strategies, since due to rack and DC considerations, replicas are not
evenly distributed among nodes. To illustrate the problem, let us
consider the following scenario:

- cluster with nodes A through H. Nodes B and G are in DC1, rest of
the nodes in DC2.
- DC shard strategy, factor 3 (without loss of generality we can omit
rack considerations from this discussion).

In this situation ranges would be (node, primary, replicas):

A: H-A, F-G
B: A-B, H-A
C: B-C, H-A, A-B
D: C-D, B-C
E: D-E, C-D
F: E-F, D-E
G: F-G, A-B, B-C, C-D, D-E, E-F
H: G-H, E-F, F-G

Now in this situation most likely node G is by far the most loaded
one, so if a node bootstraps (either a new one, or a loadbalance
operation), it will take half of G's range. Problem is, it will take
half of G's _primary_ range, but most of G's load comes from
_replicas_. After this operation, the ring would be (X denotes the new
node):

A: H-A, F-G
B: A-B, H-A
C: B-C, H-A, A-B
D: C-D, B-C
E: D-E, C-D
F: E-F, D-E
X: F-X, A-B, B-C, C-D, D-E, E-F
G: X-G, F-X
H: G-H, E-F, F-X, X-G

It is clear from this, that the situation has not really improved. The
only difference is that X is now the most loaded node and G probably
has a very light load. If another new node arrives, it will again go
in front of X, but the situation will remain largely the same.

In order to get rid of such replica sinks, nodes would need to
consider also replica "categories". When we're looking for replica
destinations, we essentially consider categories "in other DC", "in
other rack" and "anywhere". When node X boots in the ring above, it
should not just consider what is G's primary range, but what is G's
effective range (primary plus replicas). Amount of replicas is
determined largely by nodes that belong to the same replica category.
If X belongs in DC1 (same as B and G), best balance would be gained if
X booted in the middle of B and G, as that would divide replicas
evenly. This might not always be the best place, because individual
ranges might be very much different.

In order to fix this, the ideal solution would be to modify load
string so that, instead of total load, it reports both load from
primary range and load from replicas. This would allow bootstrapping
node to decide whether it should take half of replicas or half of the
primary range in order to get optimal result. However, there is no way
to get these two numbers, so we only have total load number. It would
not be perfect, but perhaps for now it would be best to only consider
nodes from the same category when making load balancing decisions.
That is, for rack unaware we consider all nodes as always, but for
other strategies we would determine the bootstrap token based on which
nodes are in the same category. Don't know if this would work, but
should be at least better than now.

Another related issue is: now that we have strategy per table, how
should we approach load balancing? Optimal decision for one strategy
might be bad for another strategy. If we have just one strategy in
use, that's clear, but for multiple strategies we'd need to determine
which one to favor.

Or am I thinking about this in a completely wrong way?

-Jaakko

Re: loadbalance and different strategies

Posted by Jonathan Ellis <jb...@gmail.com>.

On Tue, Feb 9, 2010 at 9:45 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> That seems reasonable, although it feels a little weird for X to as G
> for a token and be given one that G isn't the primary for.

"for X to ask* G"

Google SoC 2010

Posted by Krishna Sankar <ks...@gmail.com>.

All,
	Have created a Wiki Page for Google SoC 2010 - http://wiki.apache.org/cassandra/GoogleSoc2010. It is a starting point. If we have enough ideas and mentors would be good to participate and get some work done. 

	The window for mentoring organizations is March 8-12. [http://socghop.appspot.com/document/show/gsoc_program/google/gsoc2010/faqs#org_apply]. Let us see if we can get enough critical mass to make a case ...

Cheers
<k/>

Re: loadbalance and different strategies

Posted by Jonathan Ellis <jb...@gmail.com>.

On Tue, Feb 9, 2010 at 10:22 PM, Jaakko <ro...@gmail.com> wrote:
> Yes, that is of course true. However, I don't think this modification
> would make the algorithm much less simple. We still consider the most
> loaded node only, but take into account which DC the node is in.
> Without that extra step, loadbalance only works for rack unaware. If
> we make this change, nothing would change for rack unaware, but for
> other strategies things would be better, I think.

Let's give it a try!

-Jonathan

Re: loadbalance and different strategies

Posted by Jaakko <ro...@gmail.com>.

On Wed, Feb 10, 2010 at 12:45 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> On Tue, Feb 9, 2010 at 6:12 PM, Jaakko <ro...@gmail.com> wrote:
>> Let us suppose that all ranges are equal in size. In this case G's
>> range is A-G. If X boots in G's DC, it should take a token in the
>> middle of this range, which would be somewhere around D. If X boots
>> behind D
>
> Ah, I see, you are saying, "G has replicas from A-G, so really it
> should take a pare of E's range instead of G's."

More like G has replicas from A-G, so X should take half of the replicas :)

> That seems reasonable, although it feels a little weird for X to as G
> for a token and be given one that G isn't the primary for.

Yeah, it is a bit counter intuitive, but if we consider where G's load
comes from (replicas), it is natural to try to divide that range into
half instead of just considering what G's primary range is.

> You're always going to have situations where a simple algorithm does
> the "wrong" thing though, which is why we leave the raw move command
> exposed.

Yes, that is of course true. However, I don't think this modification
would make the algorithm much less simple. We still consider the most
loaded node only, but take into account which DC the node is in.
Without that extra step, loadbalance only works for rack unaware. If
we make this change, nothing would change for rack unaware, but for
other strategies things would be better, I think.

-Jaakko

Re: loadbalance and different strategies

Posted by Jonathan Ellis <jb...@gmail.com>.

On Tue, Feb 9, 2010 at 6:12 PM, Jaakko <ro...@gmail.com> wrote:
> Let us suppose that all ranges are equal in size. In this case G's
> range is A-G. If X boots in G's DC, it should take a token in the
> middle of this range, which would be somewhere around D. If X boots
> behind D

Ah, I see, you are saying, "G has replicas from A-G, so really it
should take a pare of E's range instead of G's."

That seems reasonable, although it feels a little weird for X to as G
for a token and be given one that G isn't the primary for.

> Yes, alternating the nodes is certainly the best. However, two DCs
> don't always have the same number of nodes. Also, currently
> loadbalance is unusable in such environment.

You're always going to have situations where a simple algorithm does
the "wrong" thing though, which is why we leave the raw move command
exposed.

-Jonathan

Re: loadbalance and different strategies

Posted by Jaakko <ro...@gmail.com>.

> (2) is where we get into trouble here no matter which DC we add to.
>  (a) if we add to G's DC, X will get all the replicas G has, remaining
> unbalanced
>  (b) if we add to the other DC, G will still be hit from all the
> replicas from the other DC

2b: yes
2a: not necessarily. Let's return once more to the original ring:

A: H-A, F-G
B: A-B, H-A
C: B-C, H-A, A-B
D: C-D, B-C
E: D-E, C-D
F: E-F, D-E
G: F-G, A-B, B-C, C-D, D-E, E-F
H: G-H, E-F, F-G

Let us suppose that all ranges are equal in size. In this case G's
range is A-G. If X boots in G's DC, it should take a token in the
middle of this range, which would be somewhere around D. If X boots
behind D, the ring would be:

A: H-A, F-G
B: A-B, H-A
C: B-C, H-A, A-B
D: C-D, B-C
X: D-X, A-B, B-C, C-D
E: X-E, C-D, D-X
F: E-F, X-E
G: F-G, D-X, E-F
H: G-H, E-F, F-G

In this case X takes approximately half of G's load, since it takes
approximately half of replicas G was responsible for.

IMHO a new node should take half of the most loaded node *in the same
DC*. If it only considers most loaded node, things go wrong, as
booting in the middle of that range in another DC will not balance the
load in any way.

> So ISTM that the only real solution is to do what we say in the
> Operations page, and make sure that nodes on the ring alternate DCs.

Yes, alternating the nodes is certainly the best. However, two DCs
don't always have the same number of nodes. Also, currently
loadbalance is unusable in such environment.

-Jaakko

Re: loadbalance and different strategies

Posted by Vijay <vi...@gmail.com>.

I think, when we want to add nodes in the DC Shard Strategy,  we want all
the data to be in all the DC's (More than 2) and we want to withstand some
sort of failures(1 or more node failure in any DC and still not cross the DC
for data)..... hence we should evenly distribute the servers across the
DC's..... if you have the same amount of reads and writes across all the
DC's....

Regards,
</VJ>




On Tue, Feb 9, 2010 at 11:16 AM, Stu Hood <st...@rackspace.com> wrote:

> The 'Ring management' and 'Range changes' sections of the wiki have gotten
> a lot better recently, and answer these questions. Specifically, look on
> that page for 'autobootstrap'.
>
> http://wiki.apache.org/cassandra/Operations#Ring_management
>
> Thanks,
> Stu
>
>
> -----Original Message-----
> From: "Robin Coe" <ro...@bluecoat.com>
> Sent: Tuesday, February 9, 2010 12:58pm
> To: cassandra-dev@incubator.apache.org
> Subject: Re: loadbalance and different strategies
>
> Is it true that it is no longer necessary to specify an initial token?
> If so, how would you add a new node into a ring such that it guarantees
> replicas are spread evenly across data centres?  Is this achieved simply
> by starting a new node in the opposite DC and watching the log for the
> message that it's receiving requests, before bootstrapping the next
> node?  Or is it possible to bootstrap multiple nodes simultaneously
> around the cluster and let Cassandra figure out the replica distribution
> pattern?
>
> I'm also curious about the distribution of keys across nodes.  The talk
> I've seen discusses how replicas are distributed around the cluster but
> since its the number of keys on a node that really governs its load,
> assuming all keys are retrieved with equal frequency, does the load
> balancer also function to redistribute keys amongst the nodes?
>
> Robin.
>
> On Tue, 2010-02-09 at 10:21 -0600, Jonathan Ellis wrote:
>
> > On Tue, Feb 9, 2010 at 3:13 AM, Jaakko <ro...@gmail.com> wrote:
> > > What they probably should do, is to just
> > > consider nodes in the DC they are booting to, and try to balance load
> > > evenly in that DC.
> >
> > I'm not sure what problem that would solve.  It seems to me there are two
> goals:
> >
> >  1. don't transfer data across data centers
> >  2. improve ring balance when you add nodes
> >
> > (1) should always be the case no matter where on the ring the node is
> > since there will be at least one replica of each range in each DC.
> >
> > (2) is where we get into trouble here no matter which DC we add to.
> >  (a) if we add to G's DC, X will get all the replicas G has, remaining
> > unbalanced
> >  (b) if we add to the other DC, G will still be hit from all the
> > replicas from the other DC
> >
> > So ISTM that the only real solution is to do what we say in the
> > Operations page, and make sure that nodes on the ring alternate DCs.
> > I don't think only considering nodes in the same DC helps with that.
> >
> > -Jonathan
>
>
>
>
>

Re: loadbalance and different strategies

Posted by Jonathan Ellis <jb...@gmail.com>.

On Tue, Feb 9, 2010 at 6:40 PM, Coe, Robin <ro...@bluecoat.com> wrote:
> Am I correct in assuming that a node given the flush command will not accept new writes

No.  It's not designed to do that.

RE: loadbalance and different strategies

Posted by "Coe, Robin" <ro...@bluecoat.com>.

I probably should have separated my questions; the question about 'nodeprobe flush' was based on emails I remember seeing sometime ago, about clearing the commit log, so data could be moved across nodes.  I couldn't find any information about what effect a node given that command had on incoming read/write requests.

Am I correct in assuming that a node given the flush command will not accept new writes; does issuing a flush command effectively removes the node from the cluster, as far as writes go?  I expect reads could still be allowed to succeed even after the flush command is executed but was hoping for a better understanding of the behaviour of a node that is in a flush state.

Thanks,
Robin.

-----Original Message-----
From: Stu Hood [mailto:stu.hood@rackspace.com]
Sent: Tue 09/02/2010 12:24
To: cassandra-dev@incubator.apache.org
Cc: cassandra-dev@incubator.apache.org; cassandra-users@bluecoat.com
Subject: Re: loadbalance and different strategies

In 0.5, nodes can be automatically rebalanced one at a time using the 'nodetool loadbalance' command, mentioned on that page (although, admittedly, it is in the wrong section).

'nodetool flush' has nothing to do with key distribution: it is a local operation.

> Are there any side effects from taking down an existing cluster,
> changing the tokens and restarting
You would not want to do this, unless your replication factor was high enough that every node had a replica of every other node's data. Use the instructions on the Operations page instead.

Thanks,
Stu

-----Original Message-----
From: "Robin Coe" <ro...@bluecoat.com>
Sent: Tuesday, February 9, 2010 1:48pm
To: cassandra-dev@incubator.apache.org
Cc: cassandra-users@bluecoat.com
Subject: Re: loadbalance and different strategies

Thanks for the link, Stu.

So, from what I gather, initial tokens are required for seed nodes,
which then govern how keys are distributed across the cluster, implying
that the load balancer does not perform any key redistribution function.
Does the possibility for automatic key redistribution exist in the
architecture or does the md5 hashing of keys provide a decent
probability that keys will be evenly distributed?

Given the current implementation, let's say you determine that your keys
aren't evenly distributed, thus you want to change your tokens, instead
of adding a new node.  When you issue the nodeprobe flush command, does
that disable all incoming write requests for that node?  If so, are read
requests also turned away or will the node continue to service reads
until the process is killed?

Are there any side effects from taking down an existing cluster,
changing the tokens and restarting, other than the redistribution of
data that will occur?

Thanks,
Robin.

On Tue, 2010-02-09 at 13:16 -0600, Stu Hood wrote:

> The 'Ring management' and 'Range changes' sections of the wiki have gotten a lot better recently, and answer these questions. Specifically, look on that page for 'autobootstrap'.
> 
> http://wiki.apache.org/cassandra/Operations#Ring_management
> 
> Thanks,
> Stu
> 
> 
> -----Original Message-----
> From: "Robin Coe" <ro...@bluecoat.com>
> Sent: Tuesday, February 9, 2010 12:58pm
> To: cassandra-dev@incubator.apache.org
> Subject: Re: loadbalance and different strategies
> 
> Is it true that it is no longer necessary to specify an initial token?
> If so, how would you add a new node into a ring such that it guarantees
> replicas are spread evenly across data centres?  Is this achieved simply
> by starting a new node in the opposite DC and watching the log for the
> message that it's receiving requests, before bootstrapping the next
> node?  Or is it possible to bootstrap multiple nodes simultaneously
> around the cluster and let Cassandra figure out the replica distribution
> pattern?
> 
> I'm also curious about the distribution of keys across nodes.  The talk
> I've seen discusses how replicas are distributed around the cluster but
> since its the number of keys on a node that really governs its load,
> assuming all keys are retrieved with equal frequency, does the load
> balancer also function to redistribute keys amongst the nodes? 
> 
> Robin.
> 
> On Tue, 2010-02-09 at 10:21 -0600, Jonathan Ellis wrote:
> 
> > On Tue, Feb 9, 2010 at 3:13 AM, Jaakko <ro...@gmail.com> wrote:
> > > What they probably should do, is to just
> > > consider nodes in the DC they are booting to, and try to balance load
> > > evenly in that DC.
> > 
> > I'm not sure what problem that would solve.  It seems to me there are two goals:
> > 
> >  1. don't transfer data across data centers
> >  2. improve ring balance when you add nodes
> > 
> > (1) should always be the case no matter where on the ring the node is
> > since there will be at least one replica of each range in each DC.
> > 
> > (2) is where we get into trouble here no matter which DC we add to.
> >  (a) if we add to G's DC, X will get all the replicas G has, remaining
> > unbalanced
> >  (b) if we add to the other DC, G will still be hit from all the
> > replicas from the other DC
> > 
> > So ISTM that the only real solution is to do what we say in the
> > Operations page, and make sure that nodes on the ring alternate DCs.
> > I don't think only considering nodes in the same DC helps with that.
> > 
> > -Jonathan
> 
> 
> 
>

Re: loadbalance and different strategies

Posted by Stu Hood <st...@rackspace.com>.

In 0.5, nodes can be automatically rebalanced one at a time using the 'nodetool loadbalance' command, mentioned on that page (although, admittedly, it is in the wrong section).

'nodetool flush' has nothing to do with key distribution: it is a local operation.

> Are there any side effects from taking down an existing cluster,
> changing the tokens and restarting
You would not want to do this, unless your replication factor was high enough that every node had a replica of every other node's data. Use the instructions on the Operations page instead.

Thanks,
Stu

-----Original Message-----
From: "Robin Coe" <ro...@bluecoat.com>
Sent: Tuesday, February 9, 2010 1:48pm
To: cassandra-dev@incubator.apache.org
Cc: cassandra-users@bluecoat.com
Subject: Re: loadbalance and different strategies

Thanks for the link, Stu.

So, from what I gather, initial tokens are required for seed nodes,
which then govern how keys are distributed across the cluster, implying
that the load balancer does not perform any key redistribution function.
Does the possibility for automatic key redistribution exist in the
architecture or does the md5 hashing of keys provide a decent
probability that keys will be evenly distributed?

Given the current implementation, let's say you determine that your keys
aren't evenly distributed, thus you want to change your tokens, instead
of adding a new node.  When you issue the nodeprobe flush command, does
that disable all incoming write requests for that node?  If so, are read
requests also turned away or will the node continue to service reads
until the process is killed?

Are there any side effects from taking down an existing cluster,
changing the tokens and restarting, other than the redistribution of
data that will occur?

Thanks,
Robin.

On Tue, 2010-02-09 at 13:16 -0600, Stu Hood wrote:

> The 'Ring management' and 'Range changes' sections of the wiki have gotten a lot better recently, and answer these questions. Specifically, look on that page for 'autobootstrap'.
> 
> http://wiki.apache.org/cassandra/Operations#Ring_management
> 
> Thanks,
> Stu
> 
> 
> -----Original Message-----
> From: "Robin Coe" <ro...@bluecoat.com>
> Sent: Tuesday, February 9, 2010 12:58pm
> To: cassandra-dev@incubator.apache.org
> Subject: Re: loadbalance and different strategies
> 
> Is it true that it is no longer necessary to specify an initial token?
> If so, how would you add a new node into a ring such that it guarantees
> replicas are spread evenly across data centres?  Is this achieved simply
> by starting a new node in the opposite DC and watching the log for the
> message that it's receiving requests, before bootstrapping the next
> node?  Or is it possible to bootstrap multiple nodes simultaneously
> around the cluster and let Cassandra figure out the replica distribution
> pattern?
> 
> I'm also curious about the distribution of keys across nodes.  The talk
> I've seen discusses how replicas are distributed around the cluster but
> since its the number of keys on a node that really governs its load,
> assuming all keys are retrieved with equal frequency, does the load
> balancer also function to redistribute keys amongst the nodes? 
> 
> Robin.
> 
> On Tue, 2010-02-09 at 10:21 -0600, Jonathan Ellis wrote:
> 
> > On Tue, Feb 9, 2010 at 3:13 AM, Jaakko <ro...@gmail.com> wrote:
> > > What they probably should do, is to just
> > > consider nodes in the DC they are booting to, and try to balance load
> > > evenly in that DC.
> > 
> > I'm not sure what problem that would solve.  It seems to me there are two goals:
> > 
> >  1. don't transfer data across data centers
> >  2. improve ring balance when you add nodes
> > 
> > (1) should always be the case no matter where on the ring the node is
> > since there will be at least one replica of each range in each DC.
> > 
> > (2) is where we get into trouble here no matter which DC we add to.
> >  (a) if we add to G's DC, X will get all the replicas G has, remaining
> > unbalanced
> >  (b) if we add to the other DC, G will still be hit from all the
> > replicas from the other DC
> > 
> > So ISTM that the only real solution is to do what we say in the
> > Operations page, and make sure that nodes on the ring alternate DCs.
> > I don't think only considering nodes in the same DC helps with that.
> > 
> > -Jonathan
> 
> 
> 
>

Re: loadbalance and different strategies

Posted by Robin Coe <ro...@bluecoat.com>.

Thanks for the link, Stu.

So, from what I gather, initial tokens are required for seed nodes,
which then govern how keys are distributed across the cluster, implying
that the load balancer does not perform any key redistribution function.
Does the possibility for automatic key redistribution exist in the
architecture or does the md5 hashing of keys provide a decent
probability that keys will be evenly distributed?

Given the current implementation, let's say you determine that your keys
aren't evenly distributed, thus you want to change your tokens, instead
of adding a new node.  When you issue the nodeprobe flush command, does
that disable all incoming write requests for that node?  If so, are read
requests also turned away or will the node continue to service reads
until the process is killed?

Are there any side effects from taking down an existing cluster,
changing the tokens and restarting, other than the redistribution of
data that will occur?

Thanks,
Robin.

On Tue, 2010-02-09 at 13:16 -0600, Stu Hood wrote:

> The 'Ring management' and 'Range changes' sections of the wiki have gotten a lot better recently, and answer these questions. Specifically, look on that page for 'autobootstrap'.
> 
> http://wiki.apache.org/cassandra/Operations#Ring_management
> 
> Thanks,
> Stu
> 
> 
> -----Original Message-----
> From: "Robin Coe" <ro...@bluecoat.com>
> Sent: Tuesday, February 9, 2010 12:58pm
> To: cassandra-dev@incubator.apache.org
> Subject: Re: loadbalance and different strategies
> 
> Is it true that it is no longer necessary to specify an initial token?
> If so, how would you add a new node into a ring such that it guarantees
> replicas are spread evenly across data centres?  Is this achieved simply
> by starting a new node in the opposite DC and watching the log for the
> message that it's receiving requests, before bootstrapping the next
> node?  Or is it possible to bootstrap multiple nodes simultaneously
> around the cluster and let Cassandra figure out the replica distribution
> pattern?
> 
> I'm also curious about the distribution of keys across nodes.  The talk
> I've seen discusses how replicas are distributed around the cluster but
> since its the number of keys on a node that really governs its load,
> assuming all keys are retrieved with equal frequency, does the load
> balancer also function to redistribute keys amongst the nodes? 
> 
> Robin.
> 
> On Tue, 2010-02-09 at 10:21 -0600, Jonathan Ellis wrote:
> 
> > On Tue, Feb 9, 2010 at 3:13 AM, Jaakko <ro...@gmail.com> wrote:
> > > What they probably should do, is to just
> > > consider nodes in the DC they are booting to, and try to balance load
> > > evenly in that DC.
> > 
> > I'm not sure what problem that would solve.  It seems to me there are two goals:
> > 
> >  1. don't transfer data across data centers
> >  2. improve ring balance when you add nodes
> > 
> > (1) should always be the case no matter where on the ring the node is
> > since there will be at least one replica of each range in each DC.
> > 
> > (2) is where we get into trouble here no matter which DC we add to.
> >  (a) if we add to G's DC, X will get all the replicas G has, remaining
> > unbalanced
> >  (b) if we add to the other DC, G will still be hit from all the
> > replicas from the other DC
> > 
> > So ISTM that the only real solution is to do what we say in the
> > Operations page, and make sure that nodes on the ring alternate DCs.
> > I don't think only considering nodes in the same DC helps with that.
> > 
> > -Jonathan
> 
> 
> 
>

Re: loadbalance and different strategies

Posted by Stu Hood <st...@rackspace.com>.

The 'Ring management' and 'Range changes' sections of the wiki have gotten a lot better recently, and answer these questions. Specifically, look on that page for 'autobootstrap'.

http://wiki.apache.org/cassandra/Operations#Ring_management

Thanks,
Stu

-----Original Message-----
From: "Robin Coe" <ro...@bluecoat.com>
Sent: Tuesday, February 9, 2010 12:58pm
To: cassandra-dev@incubator.apache.org
Subject: Re: loadbalance and different strategies

Is it true that it is no longer necessary to specify an initial token?
If so, how would you add a new node into a ring such that it guarantees
replicas are spread evenly across data centres?  Is this achieved simply
by starting a new node in the opposite DC and watching the log for the
message that it's receiving requests, before bootstrapping the next
node?  Or is it possible to bootstrap multiple nodes simultaneously
around the cluster and let Cassandra figure out the replica distribution
pattern?

I'm also curious about the distribution of keys across nodes.  The talk
I've seen discusses how replicas are distributed around the cluster but
since its the number of keys on a node that really governs its load,
assuming all keys are retrieved with equal frequency, does the load
balancer also function to redistribute keys amongst the nodes? 

Robin.

On Tue, 2010-02-09 at 10:21 -0600, Jonathan Ellis wrote:

> On Tue, Feb 9, 2010 at 3:13 AM, Jaakko <ro...@gmail.com> wrote:
> > What they probably should do, is to just
> > consider nodes in the DC they are booting to, and try to balance load
> > evenly in that DC.
> 
> I'm not sure what problem that would solve.  It seems to me there are two goals:
> 
>  1. don't transfer data across data centers
>  2. improve ring balance when you add nodes
> 
> (1) should always be the case no matter where on the ring the node is
> since there will be at least one replica of each range in each DC.
> 
> (2) is where we get into trouble here no matter which DC we add to.
>  (a) if we add to G's DC, X will get all the replicas G has, remaining
> unbalanced
>  (b) if we add to the other DC, G will still be hit from all the
> replicas from the other DC
> 
> So ISTM that the only real solution is to do what we say in the
> Operations page, and make sure that nodes on the ring alternate DCs.
> I don't think only considering nodes in the same DC helps with that.
> 
> -Jonathan

Re: loadbalance and different strategies

Posted by Robin Coe <ro...@bluecoat.com>.

Is it true that it is no longer necessary to specify an initial token?
If so, how would you add a new node into a ring such that it guarantees
replicas are spread evenly across data centres?  Is this achieved simply
by starting a new node in the opposite DC and watching the log for the
message that it's receiving requests, before bootstrapping the next
node?  Or is it possible to bootstrap multiple nodes simultaneously
around the cluster and let Cassandra figure out the replica distribution
pattern?

I'm also curious about the distribution of keys across nodes.  The talk
I've seen discusses how replicas are distributed around the cluster but
since its the number of keys on a node that really governs its load,
assuming all keys are retrieved with equal frequency, does the load
balancer also function to redistribute keys amongst the nodes? 

Robin.

On Tue, 2010-02-09 at 10:21 -0600, Jonathan Ellis wrote:

> On Tue, Feb 9, 2010 at 3:13 AM, Jaakko <ro...@gmail.com> wrote:
> > What they probably should do, is to just
> > consider nodes in the DC they are booting to, and try to balance load
> > evenly in that DC.
> 
> I'm not sure what problem that would solve.  It seems to me there are two goals:
> 
>  1. don't transfer data across data centers
>  2. improve ring balance when you add nodes
> 
> (1) should always be the case no matter where on the ring the node is
> since there will be at least one replica of each range in each DC.
> 
> (2) is where we get into trouble here no matter which DC we add to.
>  (a) if we add to G's DC, X will get all the replicas G has, remaining
> unbalanced
>  (b) if we add to the other DC, G will still be hit from all the
> replicas from the other DC
> 
> So ISTM that the only real solution is to do what we say in the
> Operations page, and make sure that nodes on the ring alternate DCs.
> I don't think only considering nodes in the same DC helps with that.
> 
> -Jonathan

Re: loadbalance and different strategies

Posted by Jonathan Ellis <jb...@gmail.com>.

On Tue, Feb 9, 2010 at 3:13 AM, Jaakko <ro...@gmail.com> wrote:
> What they probably should do, is to just
> consider nodes in the DC they are booting to, and try to balance load
> evenly in that DC.

I'm not sure what problem that would solve.  It seems to me there are two goals:

 1. don't transfer data across data centers
 2. improve ring balance when you add nodes

(1) should always be the case no matter where on the ring the node is
since there will be at least one replica of each range in each DC.

(2) is where we get into trouble here no matter which DC we add to.
 (a) if we add to G's DC, X will get all the replicas G has, remaining
unbalanced
 (b) if we add to the other DC, G will still be hit from all the
replicas from the other DC

So ISTM that the only real solution is to do what we say in the
Operations page, and make sure that nodes on the ring alternate DCs.
I don't think only considering nodes in the same DC helps with that.

-Jonathan

Re: loadbalance and different strategies

Posted by Jaakko <ro...@gmail.com>.

>> it will take half of G's range. Problem is, it will take
>> half of G's _primary_ range, but most of G's load comes from
>> _replicas_.

> From looking at the code recently, it chooses a token that splits G's load by actually sampling the data stored on G, which should make the primary vs replica point moot.

Yeah, missed that change. The comments in getSplits do not reflect
that change :)

However, this does not remove the issue that bootstrapping node should
consider datacenter (and possibly rack) issues, I think. If we return
to the original example and consider the difference depending on if X
is from DC1 or DC2. Here's the original ring structure:

A: H-A, F-G
B: A-B, H-A
C: B-C, H-A, A-B
D: C-D, B-C
E: D-E, C-D
F: E-F, D-E
G: F-G, A-B, B-C, C-D, D-E, E-F
H: G-H, E-F, F-G

If X is from the same DC1 as B and G, things are OK, as it will boot
in the middle of B and G. However, if X is added to DC2, it will also
boot in the middle of B and G. This will do nothing to balance the
load, so if there are multiple nodes bootstrapping, all of them will
go to the same middle region. What they probably should do, is to just
consider nodes in the DC they are booting to, and try to balance load
evenly in that DC. If the other DC does the same, overall load should
be well balanced.

-Jaakko

RE: loadbalance and different strategies

Posted by Stu Hood <st...@rackspace.com>.

> it will take half of G's range. Problem is, it will take
> half of G's _primary_ range, but most of G's load comes from
> _replicas_.
From looking at the code recently, it chooses a token that splits G's load by actually sampling the data stored on G, which should make the primary vs replica point moot.

-----Original Message-----
From: "Jaakko" <ro...@gmail.com>
Sent: Tuesday, February 9, 2010 12:51am
To: cassandra-dev@incubator.apache.org
Subject: loadbalance and different strategies

Hi,

Current implementation of loadbalance seems to work only for
RackUnaware. Problem is only RackUnaware creates a "flat" replica
space where all nodes are equal. This is not true for other
strategies, since due to rack and DC considerations, replicas are not
evenly distributed among nodes. To illustrate the problem, let us
consider the following scenario:

- cluster with nodes A through H. Nodes B and G are in DC1, rest of
the nodes in DC2.
- DC shard strategy, factor 3 (without loss of generality we can omit
rack considerations from this discussion).

In this situation ranges would be (node, primary, replicas):

A: H-A, F-G
B: A-B, H-A
C: B-C, H-A, A-B
D: C-D, B-C
E: D-E, C-D
F: E-F, D-E
G: F-G, A-B, B-C, C-D, D-E, E-F
H: G-H, E-F, F-G

Now in this situation most likely node G is by far the most loaded
one, so if a node bootstraps (either a new one, or a loadbalance
operation), it will take half of G's range. Problem is, it will take
half of G's _primary_ range, but most of G's load comes from
_replicas_. After this operation, the ring would be (X denotes the new
node):

A: H-A, F-G
B: A-B, H-A
C: B-C, H-A, A-B
D: C-D, B-C
E: D-E, C-D
F: E-F, D-E
X: F-X, A-B, B-C, C-D, D-E, E-F
G: X-G, F-X
H: G-H, E-F, F-X, X-G

It is clear from this, that the situation has not really improved. The
only difference is that X is now the most loaded node and G probably
has a very light load. If another new node arrives, it will again go
in front of X, but the situation will remain largely the same.

In order to get rid of such replica sinks, nodes would need to
consider also replica "categories". When we're looking for replica
destinations, we essentially consider categories "in other DC", "in
other rack" and "anywhere". When node X boots in the ring above, it
should not just consider what is G's primary range, but what is G's
effective range (primary plus replicas). Amount of replicas is
determined largely by nodes that belong to the same replica category.
If X belongs in DC1 (same as B and G), best balance would be gained if
X booted in the middle of B and G, as that would divide replicas
evenly. This might not always be the best place, because individual
ranges might be very much different.

In order to fix this, the ideal solution would be to modify load
string so that, instead of total load, it reports both load from
primary range and load from replicas. This would allow bootstrapping
node to decide whether it should take half of replicas or half of the
primary range in order to get optimal result. However, there is no way
to get these two numbers, so we only have total load number. It would
not be perfect, but perhaps for now it would be best to only consider
nodes from the same category when making load balancing decisions.
That is, for rack unaware we consider all nodes as always, but for
other strategies we would determine the bootstrap token based on which
nodes are in the same category. Don't know if this would work, but
should be at least better than now.

Another related issue is: now that we have strategy per table, how
should we approach load balancing? Optimal decision for one strategy
might be bad for another strategy. If we have just one strategy in
use, that's clear, but for multiple strategies we'd need to determine
which one to favor.

Or am I thinking about this in a completely wrong way?

-Jaakko