You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by 土卜皿 <pe...@gmail.com> on 2016/01/27 14:11:57 UTC

why one of the new added nodes' bootstrap is very slow?

Hi

Cassandra version: 2.1.11

The existed cluster has three nodes:

[root@report-02 cassandra]# bin/nodetool status
UN  192.21.0.135  120.85 GB  512     ?
11e1e80f-9c5f-4f7c-81f2-42d3b704d8e3  RAC1
UN  192.21.0.133  129.13 GB  512     ?
3e662ccb-fa2b-427b-9ca1-c2d3468bfbc9  RAC1
UN  192.21.0.131  149.05 GB  512     ?
60f763f3-09bc-4d6f-9301-494c93857fc1  RAC1

I wanted to add two nodes and set the same configs as the cluster's nodes.

node1: 192.21.0.184
node2: 192.21.0.185

After starting the two nodes one by one, the first node 192.21.0.184 finished
the joining immediately, but the second one 192.21.0.185 took more than 24
hours to join and not finished now:

Under 192.168.0.184:

[root@report-01 cassandra]# bin/nodetool compactionstats
pending tasks: 0

Under 192.168.0.185:

 [root@report-02 cassandra]# bin/nodetool compactionstats
 pending tasks: 21
 compaction type      keyspace       table     completed
total    unit   progress
 Compaction   testforuser   users1027    6204396079    14923537640
bytes     41.57%
 Compaction   user_center       users   19325435997   514143044706
bytes      3.76%
 Compaction   user_center       users   12305639479   118703090319
bytes     10.37%
 Active compaction remaining time :  10h05m54s

And:

[root@report-02 cassandra]# bin/nodetool status
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens  Owns    Host ID
            Rack
UN  192.21.0.135  120.85 GB  512     ?
11e1e80f-9c5f-4f7c-81f2-42d3b704d8e3  RAC1
UN  192.21.0.133  129.13 GB  512     ?
3e662ccb-fa2b-427b-9ca1-c2d3468bfbc9  RAC1
UN  192.21.0.131  149.05 GB  512     ?
60f763f3-09bc-4d6f-9301-494c93857fc1  RAC1
UJ  192.21.0.185  299.22 GB  256     ?
84c0dd16-6491-4bfb-b288-d4e410cd8c2a  RAC1
UN  192.21.0.184  670.14 MB  256     ?
4041c232-c110-4315-89a1-23ca53b851c2  RAC1

>From the above load data size, obviously, node2(192.21.0.185)'s 299.22G is
not normal.

And the node2's boostrap interrupted several times because it got a error:

INFO  00:57:42 [Stream #8eb8cbe0-c488-11e5-baf9-918c8558de90] Session
with /192.21.0.135 is complete
INFO  00:57:42 [Stream #8eb8cbe0-c488-11e5-baf9-918c8558de90] Session
with /192.21.0.131 is complete
WARN  00:57:42 [Stream #8eb8cbe0-c488-11e5-baf9-918c8558de90] Stream failed
ERROR 00:57:42 Exception encountered during startup
java.lang.RuntimeException: Error during boostrap: Stream failed

So I restarted it and the join continued!

I don't know why there is the difference between the two nodes?

I should stop it, and change something?

Thank you in advance!

Dillon

Re: why one of the new added nodes' bootstrap is very slow?

Posted by Alain RODRIGUEZ <ar...@gmail.com>.
Hi Dillon


> What should I do for this wrong bootstrap?


You should first remove the .184 nodes (the node with almost no data). The
standard command is *nodetool decommission* from the node you want remove
from the cluster. Yet this would move the data from the node we want to
remove to other nodes and we don't trust data on these 2 nodes. Instead you
can stop the node and use *nodetool removenode <HostID>* from an other node
to remove it the using other nodes to create new replicas.

Then, when your node is down and out of the cluster, any of the following
should work.

sudo rm -rf /var/lib/cassandra/data/*
sudo rm -rf /var/lib/cassandra/commitlog/*
sudo rm -rf /var/lib/cassandra/savedcaches/*
sudo rm -rf /var/log/cassandra/*

or just

sudo rm -rf /var/lib/cassandra/*

Set auto_bootstrap=true in cassandra.yaml and start cassandra when you're
ready to bootstrap.

Yes, the old nodes with Memory: 64G, Disk: 4 X 1.1T and CPU: 16 cores, the
> old nodes with: Memory 32G, Disk: 1 X 460G and CPU: 32 cores


Not sure which ones are the new and which are the old ones, but, some have
bigger CPU / memory, other have more space and it is hard to say how you
should configure vnodes, the balance Power / Space depends on your use
case. In general I *wouldn't *advise the use of heterogeneous hardware
(even if Cassandra allows it) to avoid operational overload. It is way
easier to consider each node the same way imho.

No why :-), the 512 is from some one example, 256 because I used different
> hardware, I can modified all the numbers after I add these new nodes
> successfully?


About the number of vnodes, when the node are in it is too late. I heard
(and did not check) that the number of vnodes by default is too high for
most cases, impacting performances for repairs for example. So I wanted to
let you know. Maybe someone else will be able to tell you more about it.

Under "192.21.0.185 229.2GB", I can directly "rm -rf
> /path_to_cassandra/data/" without changing anything else, and start the
> cassadra again?


I am not sure what's wrong with this node. I would probably dig it a bit
more before removing it. Did you try repairs / cleanup on your nodes ? Is
there any error in your logs ? Do you have snapshots taking some space ?
What is the output of a *nodetool status <mykeyspace> *?

Yet decision is yours, I have no idea if this is a production cluster or
not and the environment / things you did, etc.

Good luck

C*heers,

-----------------
Alain Rodriguez
France

The Last Pickle
http://www.thelastpickle.com


2016-01-28 2:57 GMT+01:00 土卜皿 <pe...@gmail.com>:

> Hi Alain,
> Thank you very much!
>
>
>> UJ  192.21.0.185  299.22 GB  256     ?       84c0dd16-6491-4bfb-b288-d4e410cd8c2a  RAC1
>>> UN  192.21.0.184  670.14 MB  256     ?       4041c232-c110-4315-89a1-23ca53b851c2  RAC1
>>>
>>>
>> Obviously .184 didn't bootstrap correctly. When a node is added, it
>> becomes responsible for a range (multiple ranges with vnodes), so it has to
>> receive data from nodes previously responsible for this (these) range(s).
>> So 600 MB looks wrong.
>>
>
> What should I do for this wrong bootstrap?
>
>
>>
>> So .185 is behaving as expected, .184 isn't.
>>
>> Yet .185 having twice the data from other node is weird unless you
>> changed Replication factor or streamed data multiple time (then compaction
>> will eventually fix this).
>>
>
> No, I did not change Replication factor
>
>
>> Plus this node has less tokens than the first 3 nodes.
>> Are you running heterogeneous hardware ?
>>
>
> Yes, the old nodes with Memory: 64G, Disk: 4 X 1.1T and CPU: 16 cores, the
> old nodes with: Memory 32G, Disk: 1 X 460G and CPU: 32 cores
>
>
>
>> Why setting 512 token for the 3 first nodes, and 256 for other nodes ?
>> From what I heard default vnodes is a way too high, you generally want to
>> go with something between 16 and 64 on production (if it is not too late).
>>
>
> No why :-), the 512 is from some one example, 256 because I used different
> hardware, I can modified all the numbers after I add these new nodes
> successfully?
>
>
> So I restarted it and the join continued! I don't know why there is the
>>> difference between the two nodes?
>>>
>> My guess is the join did not continue. Once you bootstrap a node, system
>> keyspace is filled up with some information. If the bootstrap fails, you
>> need to wipe the data directory. I advice you to directly "rm -rf
>> /path_to_cassandra/data/*".
>>
>> If you don't remove system KS, node will behave as he is already part of
>> the ring and so, won't stream anything, it won't bootstrap, just start. So
>> that would be the difference imho.
>>
>> If you just wipe the system keyspace (not your data), it will work, yet
>> you will end up streaming the same data and will need to compact, adding
>> useless work.
>>
>> So I would go clean stat and start the process again.
>>
> Sorry, I am not so clear for the above description, you mean:
>
> Under "192.21.0.185 229.2GB", I can directly "rm -rf
> /path_to_cassandra/data/" without changing anything else, and start the
> cassadra again?
>
> Under "192.21.0.184 670.14MB",  I would do something  as you said "So I
> would go clean stat and start the process again.",  what commands I should
> use to do it?
>
> Thank you very much!
>
>
> Best REGARDS
> Dillon
>
>
>>
>> https://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureDataDistributeVnodesUsing_c.html
>>
>> I would advise you to read documentation on datastax website, it will
>> save you a lot of time and troubles imho. Even if I am glad to help.
>>
>> C*heers,
>>
>> -----------------
>> Alain Rodriguez
>> France
>>
>> The Last Pickle
>> http://www.thelastpickle.com
>>
>> 2016-01-27 14:11 GMT+01:00 土卜皿 <pe...@gmail.com>:
>>
>>> Hi
>>>
>>> Cassandra version: 2.1.11
>>>
>>> The existed cluster has three nodes:
>>>
>>> [root@report-02 cassandra]# bin/nodetool status
>>> UN  192.21.0.135  120.85 GB  512     ?       11e1e80f-9c5f-4f7c-81f2-42d3b704d8e3  RAC1
>>> UN  192.21.0.133  129.13 GB  512     ?       3e662ccb-fa2b-427b-9ca1-c2d3468bfbc9  RAC1
>>> UN  192.21.0.131  149.05 GB  512     ?       60f763f3-09bc-4d6f-9301-494c93857fc1  RAC1
>>>
>>> I wanted to add two nodes and set the same configs as the cluster's
>>> nodes.
>>>
>>> node1: 192.21.0.184
>>> node2: 192.21.0.185
>>>
>>> After starting the two nodes one by one, the first node 192.21.0.184 finished
>>> the joining immediately, but the second one 192.21.0.185 took more than
>>> 24 hours to join and not finished now:
>>>
>>> Under 192.168.0.184:
>>>
>>> [root@report-01 cassandra]# bin/nodetool compactionstats
>>> pending tasks: 0
>>>
>>> Under 192.168.0.185:
>>>
>>>  [root@report-02 cassandra]# bin/nodetool compactionstats
>>>  pending tasks: 21
>>>  compaction type      keyspace       table     completed          total    unit   progress
>>>  Compaction   testforuser   users1027    6204396079    14923537640   bytes     41.57%
>>>  Compaction   user_center       users   19325435997   514143044706   bytes      3.76%
>>>  Compaction   user_center       users   12305639479   118703090319   bytes     10.37%
>>>  Active compaction remaining time :  10h05m54s
>>>
>>> And:
>>>
>>> [root@report-02 cassandra]# bin/nodetool status
>>> Datacenter: DC1
>>> ===============
>>> Status=Up/Down
>>> |/ State=Normal/Leaving/Joining/Moving
>>> --  Address       Load       Tokens  Owns    Host ID                               Rack
>>> UN  192.21.0.135  120.85 GB  512     ?       11e1e80f-9c5f-4f7c-81f2-42d3b704d8e3  RAC1
>>> UN  192.21.0.133  129.13 GB  512     ?       3e662ccb-fa2b-427b-9ca1-c2d3468bfbc9  RAC1
>>> UN  192.21.0.131  149.05 GB  512     ?       60f763f3-09bc-4d6f-9301-494c93857fc1  RAC1
>>> UJ  192.21.0.185  299.22 GB  256     ?       84c0dd16-6491-4bfb-b288-d4e410cd8c2a  RAC1
>>> UN  192.21.0.184  670.14 MB  256     ?       4041c232-c110-4315-89a1-23ca53b851c2  RAC1
>>>
>>> From the above load data size, obviously, node2(192.21.0.185)'s 299.22G
>>> is not normal.
>>>
>>> And the node2's boostrap interrupted several times because it got a
>>> error:
>>>
>>> INFO  00:57:42 [Stream #8eb8cbe0-c488-11e5-baf9-918c8558de90] Session with /192.21.0.135 is complete
>>> INFO  00:57:42 [Stream #8eb8cbe0-c488-11e5-baf9-918c8558de90] Session with /192.21.0.131 is complete
>>> WARN  00:57:42 [Stream #8eb8cbe0-c488-11e5-baf9-918c8558de90] Stream failed
>>> ERROR 00:57:42 Exception encountered during startup
>>> java.lang.RuntimeException: Error during boostrap: Stream failed
>>>
>>> So I restarted it and the join continued!
>>>
>>> I don't know why there is the difference between the two nodes?
>>>
>>> I should stop it, and change something?
>>>
>>> Thank you in advance!
>>>
>>> Dillon
>>>
>>
>>

Re: why one of the new added nodes' bootstrap is very slow?

Posted by 土卜皿 <pe...@gmail.com>.
Hi Alain,
Thank you very much!


> UJ  192.21.0.185  299.22 GB  256     ?       84c0dd16-6491-4bfb-b288-d4e410cd8c2a  RAC1
>> UN  192.21.0.184  670.14 MB  256     ?       4041c232-c110-4315-89a1-23ca53b851c2  RAC1
>>
>>
> Obviously .184 didn't bootstrap correctly. When a node is added, it
> becomes responsible for a range (multiple ranges with vnodes), so it has to
> receive data from nodes previously responsible for this (these) range(s).
> So 600 MB looks wrong.
>

What should I do for this wrong bootstrap?


>
> So .185 is behaving as expected, .184 isn't.
>
> Yet .185 having twice the data from other node is weird unless you changed
> Replication factor or streamed data multiple time (then compaction will
> eventually fix this).
>

No, I did not change Replication factor


> Plus this node has less tokens than the first 3 nodes.
> Are you running heterogeneous hardware ?
>

Yes, the old nodes with Memory: 64G, Disk: 4 X 1.1T and CPU: 16 cores, the
old nodes with: Memory 32G, Disk: 1 X 460G and CPU: 32 cores



> Why setting 512 token for the 3 first nodes, and 256 for other nodes ?
> From what I heard default vnodes is a way too high, you generally want to
> go with something between 16 and 64 on production (if it is not too late).
>

No why :-), the 512 is from some one example, 256 because I used different
hardware, I can modified all the numbers after I add these new nodes
successfully?


So I restarted it and the join continued! I don't know why there is the
>> difference between the two nodes?
>>
> My guess is the join did not continue. Once you bootstrap a node, system
> keyspace is filled up with some information. If the bootstrap fails, you
> need to wipe the data directory. I advice you to directly "rm -rf
> /path_to_cassandra/data/*".
>
> If you don't remove system KS, node will behave as he is already part of
> the ring and so, won't stream anything, it won't bootstrap, just start. So
> that would be the difference imho.
>
> If you just wipe the system keyspace (not your data), it will work, yet
> you will end up streaming the same data and will need to compact, adding
> useless work.
>
> So I would go clean stat and start the process again.
>
Sorry, I am not so clear for the above description, you mean:

Under "192.21.0.185 229.2GB", I can directly "rm -rf
/path_to_cassandra/data/" without changing anything else, and start the
cassadra again?

Under "192.21.0.184 670.14MB",  I would do something  as you said "So I
would go clean stat and start the process again.",  what commands I should
use to do it?

Thank you very much!


Best REGARDS
Dillon


>
> https://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureDataDistributeVnodesUsing_c.html
>
> I would advise you to read documentation on datastax website, it will save
> you a lot of time and troubles imho. Even if I am glad to help.
>
> C*heers,
>
> -----------------
> Alain Rodriguez
> France
>
> The Last Pickle
> http://www.thelastpickle.com
>
> 2016-01-27 14:11 GMT+01:00 土卜皿 <pe...@gmail.com>:
>
>> Hi
>>
>> Cassandra version: 2.1.11
>>
>> The existed cluster has three nodes:
>>
>> [root@report-02 cassandra]# bin/nodetool status
>> UN  192.21.0.135  120.85 GB  512     ?       11e1e80f-9c5f-4f7c-81f2-42d3b704d8e3  RAC1
>> UN  192.21.0.133  129.13 GB  512     ?       3e662ccb-fa2b-427b-9ca1-c2d3468bfbc9  RAC1
>> UN  192.21.0.131  149.05 GB  512     ?       60f763f3-09bc-4d6f-9301-494c93857fc1  RAC1
>>
>> I wanted to add two nodes and set the same configs as the cluster's nodes.
>>
>> node1: 192.21.0.184
>> node2: 192.21.0.185
>>
>> After starting the two nodes one by one, the first node 192.21.0.184 finished
>> the joining immediately, but the second one 192.21.0.185 took more than
>> 24 hours to join and not finished now:
>>
>> Under 192.168.0.184:
>>
>> [root@report-01 cassandra]# bin/nodetool compactionstats
>> pending tasks: 0
>>
>> Under 192.168.0.185:
>>
>>  [root@report-02 cassandra]# bin/nodetool compactionstats
>>  pending tasks: 21
>>  compaction type      keyspace       table     completed          total    unit   progress
>>  Compaction   testforuser   users1027    6204396079    14923537640   bytes     41.57%
>>  Compaction   user_center       users   19325435997   514143044706   bytes      3.76%
>>  Compaction   user_center       users   12305639479   118703090319   bytes     10.37%
>>  Active compaction remaining time :  10h05m54s
>>
>> And:
>>
>> [root@report-02 cassandra]# bin/nodetool status
>> Datacenter: DC1
>> ===============
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address       Load       Tokens  Owns    Host ID                               Rack
>> UN  192.21.0.135  120.85 GB  512     ?       11e1e80f-9c5f-4f7c-81f2-42d3b704d8e3  RAC1
>> UN  192.21.0.133  129.13 GB  512     ?       3e662ccb-fa2b-427b-9ca1-c2d3468bfbc9  RAC1
>> UN  192.21.0.131  149.05 GB  512     ?       60f763f3-09bc-4d6f-9301-494c93857fc1  RAC1
>> UJ  192.21.0.185  299.22 GB  256     ?       84c0dd16-6491-4bfb-b288-d4e410cd8c2a  RAC1
>> UN  192.21.0.184  670.14 MB  256     ?       4041c232-c110-4315-89a1-23ca53b851c2  RAC1
>>
>> From the above load data size, obviously, node2(192.21.0.185)'s 299.22G
>> is not normal.
>>
>> And the node2's boostrap interrupted several times because it got a error:
>>
>> INFO  00:57:42 [Stream #8eb8cbe0-c488-11e5-baf9-918c8558de90] Session with /192.21.0.135 is complete
>> INFO  00:57:42 [Stream #8eb8cbe0-c488-11e5-baf9-918c8558de90] Session with /192.21.0.131 is complete
>> WARN  00:57:42 [Stream #8eb8cbe0-c488-11e5-baf9-918c8558de90] Stream failed
>> ERROR 00:57:42 Exception encountered during startup
>> java.lang.RuntimeException: Error during boostrap: Stream failed
>>
>> So I restarted it and the join continued!
>>
>> I don't know why there is the difference between the two nodes?
>>
>> I should stop it, and change something?
>>
>> Thank you in advance!
>>
>> Dillon
>>
>
>

Re: why one of the new added nodes' bootstrap is very slow?

Posted by Alain RODRIGUEZ <ar...@gmail.com>.
Hi Dillon,

2 emails again for the same issue, just saying :-).

I'll add something I forgot answering the other email

UJ  192.21.0.185  299.22 GB  256     ?
84c0dd16-6491-4bfb-b288-d4e410cd8c2a  RAC1
> UN  192.21.0.184  670.14 MB  256     ?       4041c232-c110-4315-89a1-23ca53b851c2  RAC1
>
>
Obviously .184 didn't bootstrap correctly. When a node is added, it becomes
responsible for a range (multiple ranges with vnodes), so it has to receive
data from nodes previously responsible for this (these) range(s). So 600 MB
looks wrong.

So .185 is behaving as expected, .184 isn't.

Yet .185 having twice the data from other node is weird unless you changed
Replication factor or streamed data multiple time (then compaction will
eventually fix this). Plus this node has less tokens than the first 3 nodes.
Are you running heterogeneous hardware ? Why setting 512 token for the 3
first nodes, and 256 for other nodes ? From what I heard default vnodes is
a way too high, you generally want to go with something between 16 and 64
on production (if it is not too late).

https://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureDataDistributeVnodesUsing_c.html

I would advise you to read documentation on datastax website, it will save
you a lot of time and troubles imho. Even if I am glad to help.

C*heers,

-----------------
Alain Rodriguez
France

The Last Pickle
http://www.thelastpickle.com

2016-01-27 14:11 GMT+01:00 土卜皿 <pe...@gmail.com>:

> Hi
>
> Cassandra version: 2.1.11
>
> The existed cluster has three nodes:
>
> [root@report-02 cassandra]# bin/nodetool status
> UN  192.21.0.135  120.85 GB  512     ?       11e1e80f-9c5f-4f7c-81f2-42d3b704d8e3  RAC1
> UN  192.21.0.133  129.13 GB  512     ?       3e662ccb-fa2b-427b-9ca1-c2d3468bfbc9  RAC1
> UN  192.21.0.131  149.05 GB  512     ?       60f763f3-09bc-4d6f-9301-494c93857fc1  RAC1
>
> I wanted to add two nodes and set the same configs as the cluster's nodes.
>
> node1: 192.21.0.184
> node2: 192.21.0.185
>
> After starting the two nodes one by one, the first node 192.21.0.184 finished
> the joining immediately, but the second one 192.21.0.185 took more than
> 24 hours to join and not finished now:
>
> Under 192.168.0.184:
>
> [root@report-01 cassandra]# bin/nodetool compactionstats
> pending tasks: 0
>
> Under 192.168.0.185:
>
>  [root@report-02 cassandra]# bin/nodetool compactionstats
>  pending tasks: 21
>  compaction type      keyspace       table     completed          total    unit   progress
>  Compaction   testforuser   users1027    6204396079    14923537640   bytes     41.57%
>  Compaction   user_center       users   19325435997   514143044706   bytes      3.76%
>  Compaction   user_center       users   12305639479   118703090319   bytes     10.37%
>  Active compaction remaining time :  10h05m54s
>
> And:
>
> [root@report-02 cassandra]# bin/nodetool status
> Datacenter: DC1
> ===============
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address       Load       Tokens  Owns    Host ID                               Rack
> UN  192.21.0.135  120.85 GB  512     ?       11e1e80f-9c5f-4f7c-81f2-42d3b704d8e3  RAC1
> UN  192.21.0.133  129.13 GB  512     ?       3e662ccb-fa2b-427b-9ca1-c2d3468bfbc9  RAC1
> UN  192.21.0.131  149.05 GB  512     ?       60f763f3-09bc-4d6f-9301-494c93857fc1  RAC1
> UJ  192.21.0.185  299.22 GB  256     ?       84c0dd16-6491-4bfb-b288-d4e410cd8c2a  RAC1
> UN  192.21.0.184  670.14 MB  256     ?       4041c232-c110-4315-89a1-23ca53b851c2  RAC1
>
> From the above load data size, obviously, node2(192.21.0.185)'s 299.22G is
> not normal.
>
> And the node2's boostrap interrupted several times because it got a error:
>
> INFO  00:57:42 [Stream #8eb8cbe0-c488-11e5-baf9-918c8558de90] Session with /192.21.0.135 is complete
> INFO  00:57:42 [Stream #8eb8cbe0-c488-11e5-baf9-918c8558de90] Session with /192.21.0.131 is complete
> WARN  00:57:42 [Stream #8eb8cbe0-c488-11e5-baf9-918c8558de90] Stream failed
> ERROR 00:57:42 Exception encountered during startup
> java.lang.RuntimeException: Error during boostrap: Stream failed
>
> So I restarted it and the join continued!
>
> I don't know why there is the difference between the two nodes?
>
> I should stop it, and change something?
>
> Thank you in advance!
>
> Dillon
>