You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Marc Hoppins <ma...@eset.com> on 2022/06/03 11:38:50 UTC
Cluster & Nodetool
Hi all,
Am new to Cassandra. Just finished installing on 22 nodes across 2 datacentres.
If I run nodetool describecluster I get
Stats for all nodes:
Live: 22
Joining: 0
Moving: 0
Leaving: 0
Unreachable: 0
Data Centers:
BA #Nodes: 9 #Down: 0
DR1 #Nodes: 8 #Down: 0
There should be 12 in BA and 10 in DR1. The service is running on these other nodes...yet nodetool status also only shows the above numbers.
Datacenter: BA
==============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.1.146.197 304.72 KiB 16 11.4% 26d5a89c-aa8f-4249-b2b5-82341cc214bc SSW09
UN 10.1.146.186 245.02 KiB 16 9.0% 29f20519-51f9-493c-b891-930762d82231 SSW09
UN 10.1.146.20 129.53 KiB 16 12.5% f90dd318-1357-46ca-9870-807d988658b3 SSW09
UN 10.1.146.200 150.31 KiB 16 11.1% c544e85a-c2c5-4afd-aca8-1854a1723c2f SSW09
UN 10.1.146.17 185.9 KiB 16 11.7% db9d9856-3082-44a8-b292-156da1a17d0a SSW09
UN 10.1.146.174 288.64 KiB 16 12.1% 03126eba-8b58-4a96-80ca-10cec2e18e69 SSW09
UN 10.1.146.199 146.71 KiB 16 13.7% 860d6549-94ab-4a07-b665-70ea7e53f41a SSW09
UN 10.1.146.78 69.05 KiB 16 11.5% 7d9fdbab-40b0-4a9e-b0c9-4ffa822c42fd SSW09
UN 10.1.146.67 304.5 KiB 16 13.6% 48e9eba2-9112-4d91-8f26-8272cb5ce7bc SSW09
Datacenter: DR1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.1.146.137 209.33 KiB 16 12.6% f65c685f-048c-41de-85e4-308c4b84d047 SSW02
UN 10.1.146.141 237.21 KiB 16 9.8% 847ad921-fceb-4cef-acec-1c918d2a6517 SSW02
UN 10.1.146.131 311.05 KiB 16 11.7% 7263f6c6-c4d6-438e-8ee7-d07666242ba0 SSW02
UN 10.1.146.139 283.33 KiB 16 11.5% 264cbe47-acb4-49cc-97d0-6f9e2cee6844 SSW02
UN 10.1.146.140 258.46 KiB 16 11.6% 43dbbe91-5dac-4c3a-9df5-2f5ccf268eb6 SSW02
UN 10.1.146.132 157.03 KiB 16 12.3% 1c0cb23c-af78-4fa2-bd92-20fa7d39ec30 SSW02
UN 10.1.146.135 301.13 KiB 16 11.2% 26159fbe-cf78-4c94-88e0-54773bcf7bed SSW02
UN 10.1.146.130 305.16 KiB 16 12.5% d6d6c490-551d-4a97-a93c-3b772b750d7d SSW02
So I restarted the service on one of the missing addresses. It appeared in the list but one other dropped off. I tried this several times. It seems I can only get 9 and 8 not 12 and 10.
Anyone have an idea why this may be so?
Thanks
Marc
Re: Cluster & Nodetool
Posted by Bowen Song <bo...@bso.ng>.
Was more than one node added to the cluster at the same time? I.e. did
you start a new node which will join the cluster without waiting for a
previous node finish joining the same cluster? This can happen if you
don't have "serial: 1" in your Ansible script, or don't have a proper wait.
Removing the data directory will remove the node ID, and then the node
will join the cluster as a brand new node, which will solve the
duplicate node ID issue.
Regarding the seed nodes, you should not make all nodes seed nodes, with
the only exception of very small cluster (<= 3 nodes). Assuming you have
more than 1 DC, use 1 or 2 seed nodes per DC is fairly reasonable. Even
only 1 per DC is still reliable in a multi-DC setup, as the following 3
things must all happen at the same time to make it fail: 1. network
partitioning affecting the DC; 2. seed node failure in the same DC; and
3. starting a node in the same DC. Even so, the problem will go away
automatically once the network connectivity between DCs is restored or
the seed node comes back. If you have multiple racks per DC, you can
also consider 1 seed node per rack as an alternative.
Determining the DC & rack depends on your server provider, and tokens
per node depends on the hardware difference between nodes (is it just
the disk? or RAM and CPU too?). There's no one size fit all solution,
use your own judgement.
On 06/06/2022 09:54, Marc Hoppins wrote:
> No. It transpires that, after seeing errors when running a start.yml for ansible, I decided to start all nodes again and when starting some assumed the same ID as others.
>
> I resolved this by shutting down the service on the affected nodes, removing the data dirs. (these are all new nodes: no data) and restarted the service, one by one, making sure that the new node appeared before starting another.
>
> All are now alive and kicking (copyright Simple Minds).
>
> Vis. Seeds: given my setup only has a small number of nodes, I used 1 node out of 4 as a seed. I have seen folk suggesting every node (sounds excessive) and 1 per datacentre (seems unreliable), and also 3 seeds per datacentre...which could be adequate if all not in the same rack (which mine currently are). What suggestions/best practice? 2 per switch/rack for failover, or just go with a set number per datacentre?
>
> For automated install: how do you go about resolving dc & rack, and tokens per node (if the hardware varies)?
>
> Marc
>
> -----Original Message-----
> From: Bowen Song <bo...@bso.ng>
> Sent: Saturday, June 4, 2022 3:10 PM
> To: user@cassandra.apache.org
> Subject: Re: Cluster & Nodetool
>
> EXTERNAL
>
>
> That sounds like something caused by duplicated node IDs (the Host ID column in `nodetool status`). Did you by any chance copied the Cassandra data directory between nodes? (e.g. spinning up a new node from a VM snapshot that contains a non-empty data directory)
>
> On 03/06/2022 12:38, Marc Hoppins wrote:
>> Hi all,
>>
>> Am new to Cassandra. Just finished installing on 22 nodes across 2 datacentres.
>>
>> If I run nodetool describecluster I get
>>
>> Stats for all nodes:
>> Live: 22
>> Joining: 0
>> Moving: 0
>> Leaving: 0
>> Unreachable: 0
>>
>> Data Centers:
>> BA #Nodes: 9 #Down: 0
>> DR1 #Nodes: 8 #Down: 0
>>
>> There should be 12 in BA and 10 in DR1. The service is running on these other nodes...yet nodetool status also only shows the above numbers.
>>
>> Datacenter: BA
>> ==============
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> -- Address Load Tokens Owns (effective) Host ID Rack
>> UN 10.1.146.197 304.72 KiB 16 11.4% 26d5a89c-aa8f-4249-b2b5-82341cc214bc SSW09
>> UN 10.1.146.186 245.02 KiB 16 9.0% 29f20519-51f9-493c-b891-930762d82231 SSW09
>> UN 10.1.146.20 129.53 KiB 16 12.5% f90dd318-1357-46ca-9870-807d988658b3 SSW09
>> UN 10.1.146.200 150.31 KiB 16 11.1% c544e85a-c2c5-4afd-aca8-1854a1723c2f SSW09
>> UN 10.1.146.17 185.9 KiB 16 11.7% db9d9856-3082-44a8-b292-156da1a17d0a SSW09
>> UN 10.1.146.174 288.64 KiB 16 12.1% 03126eba-8b58-4a96-80ca-10cec2e18e69 SSW09
>> UN 10.1.146.199 146.71 KiB 16 13.7% 860d6549-94ab-4a07-b665-70ea7e53f41a SSW09
>> UN 10.1.146.78 69.05 KiB 16 11.5% 7d9fdbab-40b0-4a9e-b0c9-4ffa822c42fd SSW09
>> UN 10.1.146.67 304.5 KiB 16 13.6% 48e9eba2-9112-4d91-8f26-8272cb5ce7bc SSW09
>>
>> Datacenter: DR1
>> ===============
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> -- Address Load Tokens Owns (effective) Host ID Rack
>> UN 10.1.146.137 209.33 KiB 16 12.6% f65c685f-048c-41de-85e4-308c4b84d047 SSW02
>> UN 10.1.146.141 237.21 KiB 16 9.8% 847ad921-fceb-4cef-acec-1c918d2a6517 SSW02
>> UN 10.1.146.131 311.05 KiB 16 11.7% 7263f6c6-c4d6-438e-8ee7-d07666242ba0 SSW02
>> UN 10.1.146.139 283.33 KiB 16 11.5% 264cbe47-acb4-49cc-97d0-6f9e2cee6844 SSW02
>> UN 10.1.146.140 258.46 KiB 16 11.6% 43dbbe91-5dac-4c3a-9df5-2f5ccf268eb6 SSW02
>> UN 10.1.146.132 157.03 KiB 16 12.3% 1c0cb23c-af78-4fa2-bd92-20fa7d39ec30 SSW02
>> UN 10.1.146.135 301.13 KiB 16 11.2% 26159fbe-cf78-4c94-88e0-54773bcf7bed SSW02
>> UN 10.1.146.130 305.16 KiB 16 12.5% d6d6c490-551d-4a97-a93c-3b772b750d7d SSW02
>>
>> So I restarted the service on one of the missing addresses. It appeared in the list but one other dropped off. I tried this several times. It seems I can only get 9 and 8 not 12 and 10.
>>
>> Anyone have an idea why this may be so?
>>
>> Thanks
>>
>> Marc
RE: Cluster & Nodetool
Posted by Marc Hoppins <ma...@eset.com>.
No. It transpires that, after seeing errors when running a start.yml for ansible, I decided to start all nodes again and when starting some assumed the same ID as others.
I resolved this by shutting down the service on the affected nodes, removing the data dirs. (these are all new nodes: no data) and restarted the service, one by one, making sure that the new node appeared before starting another.
All are now alive and kicking (copyright Simple Minds).
Vis. Seeds: given my setup only has a small number of nodes, I used 1 node out of 4 as a seed. I have seen folk suggesting every node (sounds excessive) and 1 per datacentre (seems unreliable), and also 3 seeds per datacentre...which could be adequate if all not in the same rack (which mine currently are). What suggestions/best practice? 2 per switch/rack for failover, or just go with a set number per datacentre?
For automated install: how do you go about resolving dc & rack, and tokens per node (if the hardware varies)?
Marc
-----Original Message-----
From: Bowen Song <bo...@bso.ng>
Sent: Saturday, June 4, 2022 3:10 PM
To: user@cassandra.apache.org
Subject: Re: Cluster & Nodetool
EXTERNAL
That sounds like something caused by duplicated node IDs (the Host ID column in `nodetool status`). Did you by any chance copied the Cassandra data directory between nodes? (e.g. spinning up a new node from a VM snapshot that contains a non-empty data directory)
On 03/06/2022 12:38, Marc Hoppins wrote:
> Hi all,
>
> Am new to Cassandra. Just finished installing on 22 nodes across 2 datacentres.
>
> If I run nodetool describecluster I get
>
> Stats for all nodes:
> Live: 22
> Joining: 0
> Moving: 0
> Leaving: 0
> Unreachable: 0
>
> Data Centers:
> BA #Nodes: 9 #Down: 0
> DR1 #Nodes: 8 #Down: 0
>
> There should be 12 in BA and 10 in DR1. The service is running on these other nodes...yet nodetool status also only shows the above numbers.
>
> Datacenter: BA
> ==============
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> -- Address Load Tokens Owns (effective) Host ID Rack
> UN 10.1.146.197 304.72 KiB 16 11.4% 26d5a89c-aa8f-4249-b2b5-82341cc214bc SSW09
> UN 10.1.146.186 245.02 KiB 16 9.0% 29f20519-51f9-493c-b891-930762d82231 SSW09
> UN 10.1.146.20 129.53 KiB 16 12.5% f90dd318-1357-46ca-9870-807d988658b3 SSW09
> UN 10.1.146.200 150.31 KiB 16 11.1% c544e85a-c2c5-4afd-aca8-1854a1723c2f SSW09
> UN 10.1.146.17 185.9 KiB 16 11.7% db9d9856-3082-44a8-b292-156da1a17d0a SSW09
> UN 10.1.146.174 288.64 KiB 16 12.1% 03126eba-8b58-4a96-80ca-10cec2e18e69 SSW09
> UN 10.1.146.199 146.71 KiB 16 13.7% 860d6549-94ab-4a07-b665-70ea7e53f41a SSW09
> UN 10.1.146.78 69.05 KiB 16 11.5% 7d9fdbab-40b0-4a9e-b0c9-4ffa822c42fd SSW09
> UN 10.1.146.67 304.5 KiB 16 13.6% 48e9eba2-9112-4d91-8f26-8272cb5ce7bc SSW09
>
> Datacenter: DR1
> ===============
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> -- Address Load Tokens Owns (effective) Host ID Rack
> UN 10.1.146.137 209.33 KiB 16 12.6% f65c685f-048c-41de-85e4-308c4b84d047 SSW02
> UN 10.1.146.141 237.21 KiB 16 9.8% 847ad921-fceb-4cef-acec-1c918d2a6517 SSW02
> UN 10.1.146.131 311.05 KiB 16 11.7% 7263f6c6-c4d6-438e-8ee7-d07666242ba0 SSW02
> UN 10.1.146.139 283.33 KiB 16 11.5% 264cbe47-acb4-49cc-97d0-6f9e2cee6844 SSW02
> UN 10.1.146.140 258.46 KiB 16 11.6% 43dbbe91-5dac-4c3a-9df5-2f5ccf268eb6 SSW02
> UN 10.1.146.132 157.03 KiB 16 12.3% 1c0cb23c-af78-4fa2-bd92-20fa7d39ec30 SSW02
> UN 10.1.146.135 301.13 KiB 16 11.2% 26159fbe-cf78-4c94-88e0-54773bcf7bed SSW02
> UN 10.1.146.130 305.16 KiB 16 12.5% d6d6c490-551d-4a97-a93c-3b772b750d7d SSW02
>
> So I restarted the service on one of the missing addresses. It appeared in the list but one other dropped off. I tried this several times. It seems I can only get 9 and 8 not 12 and 10.
>
> Anyone have an idea why this may be so?
>
> Thanks
>
> Marc
Re: Cluster & Nodetool
Posted by Bowen Song <bo...@bso.ng>.
That sounds like something caused by duplicated node IDs (the Host ID
column in `nodetool status`). Did you by any chance copied the Cassandra
data directory between nodes? (e.g. spinning up a new node from a VM
snapshot that contains a non-empty data directory)
On 03/06/2022 12:38, Marc Hoppins wrote:
> Hi all,
>
> Am new to Cassandra. Just finished installing on 22 nodes across 2 datacentres.
>
> If I run nodetool describecluster I get
>
> Stats for all nodes:
> Live: 22
> Joining: 0
> Moving: 0
> Leaving: 0
> Unreachable: 0
>
> Data Centers:
> BA #Nodes: 9 #Down: 0
> DR1 #Nodes: 8 #Down: 0
>
> There should be 12 in BA and 10 in DR1. The service is running on these other nodes...yet nodetool status also only shows the above numbers.
>
> Datacenter: BA
> ==============
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> -- Address Load Tokens Owns (effective) Host ID Rack
> UN 10.1.146.197 304.72 KiB 16 11.4% 26d5a89c-aa8f-4249-b2b5-82341cc214bc SSW09
> UN 10.1.146.186 245.02 KiB 16 9.0% 29f20519-51f9-493c-b891-930762d82231 SSW09
> UN 10.1.146.20 129.53 KiB 16 12.5% f90dd318-1357-46ca-9870-807d988658b3 SSW09
> UN 10.1.146.200 150.31 KiB 16 11.1% c544e85a-c2c5-4afd-aca8-1854a1723c2f SSW09
> UN 10.1.146.17 185.9 KiB 16 11.7% db9d9856-3082-44a8-b292-156da1a17d0a SSW09
> UN 10.1.146.174 288.64 KiB 16 12.1% 03126eba-8b58-4a96-80ca-10cec2e18e69 SSW09
> UN 10.1.146.199 146.71 KiB 16 13.7% 860d6549-94ab-4a07-b665-70ea7e53f41a SSW09
> UN 10.1.146.78 69.05 KiB 16 11.5% 7d9fdbab-40b0-4a9e-b0c9-4ffa822c42fd SSW09
> UN 10.1.146.67 304.5 KiB 16 13.6% 48e9eba2-9112-4d91-8f26-8272cb5ce7bc SSW09
>
> Datacenter: DR1
> ===============
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> -- Address Load Tokens Owns (effective) Host ID Rack
> UN 10.1.146.137 209.33 KiB 16 12.6% f65c685f-048c-41de-85e4-308c4b84d047 SSW02
> UN 10.1.146.141 237.21 KiB 16 9.8% 847ad921-fceb-4cef-acec-1c918d2a6517 SSW02
> UN 10.1.146.131 311.05 KiB 16 11.7% 7263f6c6-c4d6-438e-8ee7-d07666242ba0 SSW02
> UN 10.1.146.139 283.33 KiB 16 11.5% 264cbe47-acb4-49cc-97d0-6f9e2cee6844 SSW02
> UN 10.1.146.140 258.46 KiB 16 11.6% 43dbbe91-5dac-4c3a-9df5-2f5ccf268eb6 SSW02
> UN 10.1.146.132 157.03 KiB 16 12.3% 1c0cb23c-af78-4fa2-bd92-20fa7d39ec30 SSW02
> UN 10.1.146.135 301.13 KiB 16 11.2% 26159fbe-cf78-4c94-88e0-54773bcf7bed SSW02
> UN 10.1.146.130 305.16 KiB 16 12.5% d6d6c490-551d-4a97-a93c-3b772b750d7d SSW02
>
> So I restarted the service on one of the missing addresses. It appeared in the list but one other dropped off. I tried this several times. It seems I can only get 9 and 8 not 12 and 10.
>
> Anyone have an idea why this may be so?
>
> Thanks
>
> Marc