You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Mark Jones <MJ...@imagehawk.com> on 2010/04/07 15:48:13 UTC

What is loadbalance supposed to do? 0.6.0RC1

It shouldn't remove a node from the ring should it?  (appears it did)
It shouldn't remove data from db, should it?  (data size appears to grow, but records are now missing)

Loaded 38 million "rows" and the ring looked like this:

  mark@ec2:~/cassandra/apache-cassandra-0.6.0-rc1$ bin/nodetool --host 192.168.1.116 ring
  Address       Status     Load          Range                                      Ring
                                         167730615856220406399741259265091647472
  192.168.1.116 Up         4.81 GB       54880762918591020775962843965839761529     |<--|
  192.168.1.119 Up         12.96 GB      160455137948102479104219052453775170160    |   |
  192.168.1.12  Up         8.98 GB       167730615856220406399741259265091647472    |--

So I did this:
  mark@record:~/cassandra/apache-cassandra-0.6.0-rc1$ bin/nodetool --host 192.168.1.12 loadbalance

And this happened (even though Cassandra was still running):

  mark@record:~/cassandra/apache-cassandra-0.6.0-rc1$ bin/nodetool --host 192.168.1.12 ring
  Address       Status     Load          Range                                      Ring
                                         160455137948102479104219052453775170160
  192.168.1.116 Up         12.71 GB      54880762918591020775962843965839761529     |<--|
  192.168.1.119 Up         13.47 GB      160455137948102479104219052453775170160    |-->|

After restarting Cassandra on .12

  mark@record:~/cassandra/apache-cassandra-0.6.0-rc1$ bin/nodetool --host 192.168.1.12 ring
  Address       Status     Load          Range                                      Ring
                                         160455137948102479104219052453775170160
  192.168.1.116 Up         12.71 GB      54880762918591020775962843965839761529     |<--|
  192.168.1.12  Up         8.98 GB       107669873051407416105654071439122680093    |   |
  192.168.1.119 Up         13.47 GB      160455137948102479104219052453775170160    |-->|

Now I have more data, but nearly 50% of my queries are failing (not found).  This data was checked before the load balance was done.

Re: What is loadbalance supposed to do? 0.6.0RC1

Posted by Rob Coli <rc...@digg.com>.

On 4/7/10 7:39 AM, Mark Jones wrote:
> Also, if the data is pushed out to the other nodes before the bootstrapping, why has data been lost?  Does this mean that decommissioning a node results in data loss?

As I understand it, in the following scenario :

1) Node A has Keys 0-10.

2) Add Node B as a bootstrapping node, Node A is loadbalanced, sheds 
keys 5-10 to Node B.

Keys 5-10 are not actually removed from the SSTables on Node A until a 
"cleanup compaction" is run. A "cleanup compaction" is a "major 
compaction" which also checks to see whether keys still "belong" on this 
host.

I don't know whether you have actually experienced data loss, but based 
on the above, it should not be possible for you to have.

=Rob

RE: What is loadbalance supposed to do? 0.6.0RC1

Posted by Mark Jones <MJ...@imagehawk.com>.

The log said Bootstrapping  @ 07:34  (since it was 08:35, I assumed it wasn't doing anything, also, CPU usage was < 10%)

Turns out, when I restarted the node, it claimed the time was 7:35 rather than 8:35.  Why would log4j be off by one hour?  We are on CDT here, and have been for more than a week.  The date command returns the appropriate time (Wed Apr  7 09:24:50 CDT 2010), I see no evidence of a TZ variable and /etc/timezone shows "America/Chicago"

If it was off by 6 hours instead of 1, I could understand this, but its only off by one hour.

System.getProperties() reports the timezone as blank

Also, if the data is pushed out to the other nodes before the bootstrapping, why has data been lost?  Does this mean that decommissioning a node results in data loss?



-----Original Message-----
From: Sylvain Lebresne [mailto:sylvain@yakaz.com]
Sent: Wednesday, April 07, 2010 9:07 AM
To: user@cassandra.apache.org
Subject: Re: What is loadbalance supposed to do? 0.6.0RC1

> It shouldn't remove a node from the ring should it?  (appears it did)

It does. As explained here: http://wiki.apache.org/cassandra/Operations,
loadbalance 'decomission' the node and then add it back as a bootstrapping
node (roughly).

So that the node disappear is expected and it is supposed to come back.
But this is not a quick operation (and certainely not one you want to do every
other day). You apparently restarted Cassandra while it was doing its stuff.

Not sure the loss of data is to be expected though.

> It shouldn't remove data from db, should it?  (data size appears to grow, but records are now missing)
>
> Loaded 38 million "rows" and the ring looked like this:
>
>  mark@ec2:~/cassandra/apache-cassandra-0.6.0-rc1$ bin/nodetool --host 192.168.1.116 ring
>  Address       Status     Load          Range                                      Ring
>                                         167730615856220406399741259265091647472
>  192.168.1.116 Up         4.81 GB       54880762918591020775962843965839761529     |<--|
>  192.168.1.119 Up         12.96 GB      160455137948102479104219052453775170160    |   |
>  192.168.1.12  Up         8.98 GB       167730615856220406399741259265091647472    |--
>
> So I did this:
>  mark@record:~/cassandra/apache-cassandra-0.6.0-rc1$ bin/nodetool --host 192.168.1.12 loadbalance
>
> And this happened (even though Cassandra was still running):
>
>  mark@record:~/cassandra/apache-cassandra-0.6.0-rc1$ bin/nodetool --host 192.168.1.12 ring
>  Address       Status     Load          Range                                      Ring
>                                         160455137948102479104219052453775170160
>  192.168.1.116 Up         12.71 GB      54880762918591020775962843965839761529     |<--|
>  192.168.1.119 Up         13.47 GB      160455137948102479104219052453775170160    |-->|
>
> After restarting Cassandra on .12
>
>  mark@record:~/cassandra/apache-cassandra-0.6.0-rc1$ bin/nodetool --host 192.168.1.12 ring
>  Address       Status     Load          Range                                      Ring
>                                         160455137948102479104219052453775170160
>  192.168.1.116 Up         12.71 GB      54880762918591020775962843965839761529     |<--|
>  192.168.1.12  Up         8.98 GB       107669873051407416105654071439122680093    |   |
>  192.168.1.119 Up         13.47 GB      160455137948102479104219052453775170160    |-->|
>
> Now I have more data, but nearly 50% of my queries are failing (not found).  This data was checked before the load balance was done.
>

Re: What is loadbalance supposed to do? 0.6.0RC1

Posted by Sylvain Lebresne <sy...@yakaz.com>.

> It shouldn't remove a node from the ring should it?  (appears it did)

It does. As explained here: http://wiki.apache.org/cassandra/Operations,
loadbalance 'decomission' the node and then add it back as a bootstrapping
node (roughly).

So that the node disappear is expected and it is supposed to come back.
But this is not a quick operation (and certainely not one you want to do every
other day). You apparently restarted Cassandra while it was doing its stuff.

Not sure the loss of data is to be expected though.

> It shouldn't remove data from db, should it?  (data size appears to grow, but records are now missing)
>
> Loaded 38 million "rows" and the ring looked like this:
>
>  mark@ec2:~/cassandra/apache-cassandra-0.6.0-rc1$ bin/nodetool --host 192.168.1.116 ring
>  Address       Status     Load          Range                                      Ring
>                                         167730615856220406399741259265091647472
>  192.168.1.116 Up         4.81 GB       54880762918591020775962843965839761529     |<--|
>  192.168.1.119 Up         12.96 GB      160455137948102479104219052453775170160    |   |
>  192.168.1.12  Up         8.98 GB       167730615856220406399741259265091647472    |--
>
> So I did this:
>  mark@record:~/cassandra/apache-cassandra-0.6.0-rc1$ bin/nodetool --host 192.168.1.12 loadbalance
>
> And this happened (even though Cassandra was still running):
>
>  mark@record:~/cassandra/apache-cassandra-0.6.0-rc1$ bin/nodetool --host 192.168.1.12 ring
>  Address       Status     Load          Range                                      Ring
>                                         160455137948102479104219052453775170160
>  192.168.1.116 Up         12.71 GB      54880762918591020775962843965839761529     |<--|
>  192.168.1.119 Up         13.47 GB      160455137948102479104219052453775170160    |-->|
>
> After restarting Cassandra on .12
>
>  mark@record:~/cassandra/apache-cassandra-0.6.0-rc1$ bin/nodetool --host 192.168.1.12 ring
>  Address       Status     Load          Range                                      Ring
>                                         160455137948102479104219052453775170160
>  192.168.1.116 Up         12.71 GB      54880762918591020775962843965839761529     |<--|
>  192.168.1.12  Up         8.98 GB       107669873051407416105654071439122680093    |   |
>  192.168.1.119 Up         13.47 GB      160455137948102479104219052453775170160    |-->|
>
> Now I have more data, but nearly 50% of my queries are failing (not found).  This data was checked before the load balance was done.
>