You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Rodrigo Felix <ro...@gmail.com> on 2013/07/06 22:50:25 UTC

General doubts about bootstrap

Hi,

   I'm facing some problems and if you could help on some of them I'd thank
you.
   *Environment:* 2 seeds and 2 other nodes, all installed on m1.large EC2
instances. Each seed starts with about 1.7GB of data. Default cassandra
configuration.

   - Is it normal to take about 9 minutes to add a new node? Follows the
   log generated by a script to add a new node.

[06/07/2013 20:07:53] Remove all data stored in the Cassandra node
[06/07/2013 20:07:54] [OK] All data successfully removed
[06/07/2013 20:07:54] Setting seeds on cassandra.yml
[06/07/2013 20:07:54] [OK] seeds successfully set
[06/07/2013 20:07:54] Setting listen_address on cassandra.yml
[06/07/2013 20:07:54] [OK] listen_address successfully set
[06/07/2013 20:07:54] Setting initial_token on cassandra.yml
[06/07/2013 20:07:54] [OK] initial_token successfully set
*[06/07/2013 20:07:54] Starting cassandra...*
*[06/07/2013 20:16:36] [OK] Cassandra started*
[06/07/2013 20:16:37] Changing token of i-5cfc082f
[06/07/2013 20:18:00] [OK] Token of i-5cfc082f successfully set to
56713727820156410577229101238628035242
[06/07/2013 20:18:00] Cleaning up i-5cfc082f
[06/07/2013 20:20:13] Clean up of i-5cfc082f successfully finished
[06/07/2013 20:20:13] Machine added

   - Is there a way to reduce the time to start cassandra?
   - Sometimes cleanup operation takes make minutes (about 10). Is this
   normal since the amount of data is small (1.7gb at maximum / seed)?
   - Considering that I have two seeds in the beginning, their tokens are 0
   and 85070591730234615865843651857942052864. When I add a new machine, do I
   need to execute move and cleanup on both seeds? Nowadays, I'm running
   cleanup on seed 0, move + cleanup on the other seed and neither move nor
   cleanup on the just added node. Is this OK?
   - What if I do not run cleanup in any existing node when adding or
   removing a node? Is the data that was not "cleaned up" still available if I
   send a scan, for instance, and the scan range is still in the node but it
   wouldn't be there if I had run cleanup? Data would be gather from other
   node, ie. the one that properly has the range specified in the scan query?
   - After decommissioning a node, is it advisable to run cleanup in the
   remaining nodes? The consequences of not to run are the same of not to run
   when adding a node?

   Thank you very much in advance.

Att.

*Rodrigo Felix de Almeida*
LSBD - Universidade Federal do Ceará
Project Manager
MBA, CSM, CSPO, SCJP

Re: General doubts about bootstrap

Posted by Rodrigo Felix <ro...@gmail.com>.
Currently, I'm using cassandra 1.1.5, but I'm considering to update to
1.2.x in order to make use of vnodes.
Doubling the size is not possible to me because I want to measure the
response while adding (or removing) single nodes.
Thank you guys. It help me a lot to understand better how cassandra works.

Att.

*Rodrigo Felix de Almeida*
LSBD - Universidade Federal do Ceará
Project Manager
MBA, CSM, CSPO, SCJP


On Wed, Jul 10, 2013 at 11:11 AM, Eric Stevens <mi...@gmail.com> wrote:

> > => Adding a new node between other nodes would avoid running move, but
> the ring would be unbalanced, right? Would this imply in having a node
> (with bigger range, 1/2 of the range while other 2 nodes have 1/2 each,
> supposing 3 nodes) overloaded? I'm refering
> http://wiki.apache.org/cassandra/Operations#Load_balancing
>>
>>
>>>
>>> Yes, if you're using a single vnode per server, or are running an older
> version of Cassandra.  For lowest impact, doubling the size of your cluster
> is recommended so that you can avoid doing moves.  Or if you're on
> Cassandra 1.2+, you can use vnodes, and you should not typically need to
> rebalance after bringing a new server online.
>
>
> On Tue, Jul 9, 2013 at 9:31 PM, Rodrigo Felix <
> rodrigofelixdealmeida@gmail.com> wrote:
>
>> Thank you very much for you response. Follows my comments about your
>> email.
>>
>> Att.
>>
>> *Rodrigo Felix de Almeida*
>> LSBD - Universidade Federal do Ceará
>> Project Manager
>> MBA, CSM, CSPO, SCJP
>>
>>
>> On Mon, Jul 8, 2013 at 6:05 PM, Robert Coli <rc...@eventbrite.com> wrote:
>>
>>> On Sat, Jul 6, 2013 at 1:50 PM, Rodrigo Felix <
>>> rodrigofelixdealmeida@gmail.com> wrote:
>>>
>>>>
>>>>    - Is it normal to take about 9 minutes to add a new node? Follows
>>>>    the log generated by a script to add a new node.
>>>>
>>>> Sure.  => OK
>>>
>>>>
>>>>    - Is there a way to reduce the time to start cassandra?
>>>>
>>>> Not usually. => OK
>>>
>>>>
>>>>    - Sometimes cleanup operation takes make minutes (about 10). Is
>>>>    this normal since the amount of data is small (1.7gb at maximum / seed)?
>>>>
>>>> Compaction is throttled, and cleanup is a type of compaction. Bootstrap
>>> is also throttled via the streaming throttle. => OK
>>>
>>>>
>>>>    - Considering that I have two seeds in the beginning, their tokens
>>>>    are 0 and 85070591730234615865843651857942052864. When I add a new machine,
>>>>    do I need to execute move and cleanup on both seeds? Nowadays, I'm running
>>>>    cleanup on seed 0, move + cleanup on the other seed and neither move nor
>>>>    cleanup on the just added node. Is this OK?
>>>>
>>>> Only nodes which have "lost" ranges need to run cleanup. In general you
>>> should add new nodes "between" other nodes such that "move" is not required
>>> at all.
>>>
>>
>> => Adding a new node between other nodes would avoid running move, but
>> the ring would be unbalanced, right? Would this imply in having a node
>> (with bigger range, 1/2 of the range while other 2 nodes have 1/2 each,
>> supposing 3 nodes) overloaded? I'm refering
>> http://wiki.apache.org/cassandra/Operations#Load_balancing
>>
>>>
>>>>    - What if I do not run cleanup in any existing node when adding or
>>>>    removing a node? Is the data that was not "cleaned up" still available if I
>>>>    send a scan, for instance, and the scan range is still in the node but it
>>>>    wouldn't be there if I had run cleanup? Data would be gather from other
>>>>    node, ie. the one that properly has the range specified in the scan query?
>>>>
>>>> If data for range [x] is on node [a] but node [a] is no longer
>>> considered an endpoint for range [x], it will never receive a request to
>>> serve range [x]. => OK
>>>
>>>>
>>>>    - After decommissioning a node, is it advisable to run cleanup in
>>>>    the remaining nodes? The consequences of not to run are the same of not to
>>>>    run when adding a node?
>>>>
>>>> Cleanup is only for the node which lost a range. In decommission case,
>>> no live nodes lost a range, only some nodes gained one. => OK
>>>
>>> =Rob
>>>
>>
>>
>

Re: General doubts about bootstrap

Posted by Eric Stevens <mi...@gmail.com>.
> => Adding a new node between other nodes would avoid running move, but
the ring would be unbalanced, right? Would this imply in having a node
(with bigger range, 1/2 of the range while other 2 nodes have 1/2 each,
supposing 3 nodes) overloaded? I'm refering
http://wiki.apache.org/cassandra/Operations#Load_balancing
>
>
>>
>> Yes, if you're using a single vnode per server, or are running an older
version of Cassandra.  For lowest impact, doubling the size of your cluster
is recommended so that you can avoid doing moves.  Or if you're on
Cassandra 1.2+, you can use vnodes, and you should not typically need to
rebalance after bringing a new server online.


On Tue, Jul 9, 2013 at 9:31 PM, Rodrigo Felix <
rodrigofelixdealmeida@gmail.com> wrote:

> Thank you very much for you response. Follows my comments about your email.
>
> Att.
>
> *Rodrigo Felix de Almeida*
> LSBD - Universidade Federal do Ceará
> Project Manager
> MBA, CSM, CSPO, SCJP
>
>
> On Mon, Jul 8, 2013 at 6:05 PM, Robert Coli <rc...@eventbrite.com> wrote:
>
>> On Sat, Jul 6, 2013 at 1:50 PM, Rodrigo Felix <
>> rodrigofelixdealmeida@gmail.com> wrote:
>>
>>>
>>>    - Is it normal to take about 9 minutes to add a new node? Follows
>>>    the log generated by a script to add a new node.
>>>
>>> Sure.  => OK
>>
>>>
>>>    - Is there a way to reduce the time to start cassandra?
>>>
>>> Not usually. => OK
>>
>>>
>>>    - Sometimes cleanup operation takes make minutes (about 10). Is this
>>>    normal since the amount of data is small (1.7gb at maximum / seed)?
>>>
>>> Compaction is throttled, and cleanup is a type of compaction. Bootstrap
>> is also throttled via the streaming throttle. => OK
>>
>>>
>>>    - Considering that I have two seeds in the beginning, their tokens
>>>    are 0 and 85070591730234615865843651857942052864. When I add a new machine,
>>>    do I need to execute move and cleanup on both seeds? Nowadays, I'm running
>>>    cleanup on seed 0, move + cleanup on the other seed and neither move nor
>>>    cleanup on the just added node. Is this OK?
>>>
>>> Only nodes which have "lost" ranges need to run cleanup. In general you
>> should add new nodes "between" other nodes such that "move" is not required
>> at all.
>>
>
> => Adding a new node between other nodes would avoid running move, but the
> ring would be unbalanced, right? Would this imply in having a node (with
> bigger range, 1/2 of the range while other 2 nodes have 1/2 each, supposing
> 3 nodes) overloaded? I'm refering
> http://wiki.apache.org/cassandra/Operations#Load_balancing
>
>>
>>>    - What if I do not run cleanup in any existing node when adding or
>>>    removing a node? Is the data that was not "cleaned up" still available if I
>>>    send a scan, for instance, and the scan range is still in the node but it
>>>    wouldn't be there if I had run cleanup? Data would be gather from other
>>>    node, ie. the one that properly has the range specified in the scan query?
>>>
>>> If data for range [x] is on node [a] but node [a] is no longer
>> considered an endpoint for range [x], it will never receive a request to
>> serve range [x]. => OK
>>
>>>
>>>    - After decommissioning a node, is it advisable to run cleanup in
>>>    the remaining nodes? The consequences of not to run are the same of not to
>>>    run when adding a node?
>>>
>>> Cleanup is only for the node which lost a range. In decommission case,
>> no live nodes lost a range, only some nodes gained one. => OK
>>
>> =Rob
>>
>
>

Re: General doubts about bootstrap

Posted by Rodrigo Felix <ro...@gmail.com>.
Thank you very much for you response. Follows my comments about your email.

Att.

*Rodrigo Felix de Almeida*
LSBD - Universidade Federal do Ceará
Project Manager
MBA, CSM, CSPO, SCJP


On Mon, Jul 8, 2013 at 6:05 PM, Robert Coli <rc...@eventbrite.com> wrote:

> On Sat, Jul 6, 2013 at 1:50 PM, Rodrigo Felix <
> rodrigofelixdealmeida@gmail.com> wrote:
>
>>
>>    - Is it normal to take about 9 minutes to add a new node? Follows the
>>    log generated by a script to add a new node.
>>
>> Sure.  => OK
>
>>
>>    - Is there a way to reduce the time to start cassandra?
>>
>> Not usually. => OK
>
>>
>>    - Sometimes cleanup operation takes make minutes (about 10). Is this
>>    normal since the amount of data is small (1.7gb at maximum / seed)?
>>
>> Compaction is throttled, and cleanup is a type of compaction. Bootstrap
> is also throttled via the streaming throttle. => OK
>
>>
>>    - Considering that I have two seeds in the beginning, their tokens
>>    are 0 and 85070591730234615865843651857942052864. When I add a new machine,
>>    do I need to execute move and cleanup on both seeds? Nowadays, I'm running
>>    cleanup on seed 0, move + cleanup on the other seed and neither move nor
>>    cleanup on the just added node. Is this OK?
>>
>> Only nodes which have "lost" ranges need to run cleanup. In general you
> should add new nodes "between" other nodes such that "move" is not required
> at all.
>

=> Adding a new node between other nodes would avoid running move, but the
ring would be unbalanced, right? Would this imply in having a node (with
bigger range, 1/2 of the range while other 2 nodes have 1/2 each, supposing
3 nodes) overloaded? I'm refering
http://wiki.apache.org/cassandra/Operations#Load_balancing

>
>>    - What if I do not run cleanup in any existing node when adding or
>>    removing a node? Is the data that was not "cleaned up" still available if I
>>    send a scan, for instance, and the scan range is still in the node but it
>>    wouldn't be there if I had run cleanup? Data would be gather from other
>>    node, ie. the one that properly has the range specified in the scan query?
>>
>> If data for range [x] is on node [a] but node [a] is no longer considered
> an endpoint for range [x], it will never receive a request to serve range
> [x]. => OK
>
>>
>>    - After decommissioning a node, is it advisable to run cleanup in the
>>    remaining nodes? The consequences of not to run are the same of not to run
>>    when adding a node?
>>
>> Cleanup is only for the node which lost a range. In decommission case, no
> live nodes lost a range, only some nodes gained one. => OK
>
> =Rob
>

Re: General doubts about bootstrap

Posted by Robert Coli <rc...@eventbrite.com>.
On Sat, Jul 6, 2013 at 1:50 PM, Rodrigo Felix <
rodrigofelixdealmeida@gmail.com> wrote:

>
>    - Is it normal to take about 9 minutes to add a new node? Follows the
>    log generated by a script to add a new node.
>
> Sure.

>
>    - Is there a way to reduce the time to start cassandra?
>
> Not usually.

>
>    - Sometimes cleanup operation takes make minutes (about 10). Is this
>    normal since the amount of data is small (1.7gb at maximum / seed)?
>
> Compaction is throttled, and cleanup is a type of compaction. Bootstrap is
also throttled via the streaming throttle.

>
>    - Considering that I have two seeds in the beginning, their tokens are
>    0 and 85070591730234615865843651857942052864. When I add a new machine, do
>    I need to execute move and cleanup on both seeds? Nowadays, I'm running
>    cleanup on seed 0, move + cleanup on the other seed and neither move nor
>    cleanup on the just added node. Is this OK?
>
> Only nodes which have "lost" ranges need to run cleanup. In general you
should add new nodes "between" other nodes such that "move" is not required
at all.

>
>    - What if I do not run cleanup in any existing node when adding or
>    removing a node? Is the data that was not "cleaned up" still available if I
>    send a scan, for instance, and the scan range is still in the node but it
>    wouldn't be there if I had run cleanup? Data would be gather from other
>    node, ie. the one that properly has the range specified in the scan query?
>
> If data for range [x] is on node [a] but node [a] is no longer considered
an endpoint for range [x], it will never receive a request to serve range
[x].

>
>    - After decommissioning a node, is it advisable to run cleanup in the
>    remaining nodes? The consequences of not to run are the same of not to run
>    when adding a node?
>
> Cleanup is only for the node which lost a range. In decommission case, no
live nodes lost a range, only some nodes gained one.

=Rob