You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Marc Hoppins <ma...@eset.com> on 2022/06/03 11:38:50 UTC

Cluster & Nodetool

Hi all,

Am new to Cassandra.  Just finished installing on 22 nodes across 2 datacentres.

If I run nodetool describecluster  I get

Stats for all nodes:
        Live: 22
        Joining: 0
        Moving: 0
        Leaving: 0
        Unreachable: 0

Data Centers:
        BA #Nodes: 9 #Down: 0
        DR1 #Nodes: 8 #Down: 0

There should be 12 in BA and 10 in DR1.  The service is running on these other nodes...yet nodetool status also only shows the above numbers.

Datacenter: BA
==============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load        Tokens  Owns (effective)  Host ID                               Rack
UN  10.1.146.197  304.72 KiB  16      11.4%             26d5a89c-aa8f-4249-b2b5-82341cc214bc  SSW09
UN  10.1.146.186  245.02 KiB  16      9.0%              29f20519-51f9-493c-b891-930762d82231  SSW09
UN  10.1.146.20   129.53 KiB  16      12.5%             f90dd318-1357-46ca-9870-807d988658b3  SSW09
UN  10.1.146.200  150.31 KiB  16      11.1%             c544e85a-c2c5-4afd-aca8-1854a1723c2f  SSW09
UN  10.1.146.17   185.9 KiB   16      11.7%             db9d9856-3082-44a8-b292-156da1a17d0a  SSW09
UN  10.1.146.174  288.64 KiB  16      12.1%             03126eba-8b58-4a96-80ca-10cec2e18e69  SSW09
UN  10.1.146.199  146.71 KiB  16      13.7%             860d6549-94ab-4a07-b665-70ea7e53f41a  SSW09
UN  10.1.146.78   69.05 KiB   16      11.5%             7d9fdbab-40b0-4a9e-b0c9-4ffa822c42fd  SSW09
UN  10.1.146.67   304.5 KiB   16      13.6%             48e9eba2-9112-4d91-8f26-8272cb5ce7bc  SSW09

Datacenter: DR1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load        Tokens  Owns (effective)  Host ID                               Rack
UN  10.1.146.137  209.33 KiB  16      12.6%             f65c685f-048c-41de-85e4-308c4b84d047  SSW02
UN  10.1.146.141  237.21 KiB  16      9.8%              847ad921-fceb-4cef-acec-1c918d2a6517  SSW02
UN  10.1.146.131  311.05 KiB  16      11.7%             7263f6c6-c4d6-438e-8ee7-d07666242ba0  SSW02
UN  10.1.146.139  283.33 KiB  16      11.5%             264cbe47-acb4-49cc-97d0-6f9e2cee6844  SSW02
UN  10.1.146.140  258.46 KiB  16      11.6%             43dbbe91-5dac-4c3a-9df5-2f5ccf268eb6  SSW02
UN  10.1.146.132  157.03 KiB  16      12.3%             1c0cb23c-af78-4fa2-bd92-20fa7d39ec30  SSW02
UN  10.1.146.135  301.13 KiB  16      11.2%             26159fbe-cf78-4c94-88e0-54773bcf7bed  SSW02
UN  10.1.146.130  305.16 KiB  16      12.5%             d6d6c490-551d-4a97-a93c-3b772b750d7d  SSW02

So I restarted the service on one of the missing addresses. It appeared in the list but one other dropped off.  I tried this several times.  It seems I can only get 9 and 8 not 12 and 10.

Anyone have an idea why this may be so?

Thanks

Marc

Re: Cluster & Nodetool

Posted by Bowen Song <bo...@bso.ng>.

Was more than one node added to the cluster at the same time? I.e. did 
you start a new node which will join the cluster without waiting for a 
previous node finish joining the same cluster? This can happen if you 
don't have "serial: 1" in your Ansible script, or don't have a proper wait.

Removing the data directory will remove the node ID, and then the node 
will join the cluster as a brand new node, which will solve the 
duplicate node ID issue.

Regarding the seed nodes, you should not make all nodes seed nodes, with 
the only exception of very small cluster (<= 3 nodes). Assuming you have 
more than 1 DC, use 1 or 2 seed nodes per DC is fairly reasonable. Even 
only 1 per DC is still reliable in a multi-DC setup, as the following 3 
things must all happen at the same time to make it fail: 1. network 
partitioning affecting the DC; 2. seed node failure in the same DC; and 
3. starting a node in the same DC. Even so, the problem will go away 
automatically once the network connectivity between DCs is restored or 
the seed node comes back. If you have multiple racks per DC, you can 
also consider 1 seed node per rack as an alternative.

Determining the DC & rack depends on your server provider, and tokens 
per node depends on the hardware difference between nodes (is it just 
the disk? or RAM and CPU too?). There's no one size fit all solution, 
use your own judgement.

On 06/06/2022 09:54, Marc Hoppins wrote:
> No. It transpires that, after seeing errors when running a start.yml for ansible, I decided to start all nodes again and when starting some assumed the same ID as others.
>
> I resolved this by shutting down the service on the affected nodes, removing the data dirs. (these are all new nodes: no data) and restarted the service, one by one, making sure that the new node appeared before starting another.
>
> All are now alive and kicking (copyright Simple Minds).
>
> Vis. Seeds: given my setup only has a small number of nodes, I used 1 node out of 4 as a seed. I have seen folk suggesting every node (sounds excessive) and 1 per datacentre (seems unreliable), and also 3 seeds per datacentre...which could be adequate if all not in the same rack (which mine currently are).  What suggestions/best practice?  2 per switch/rack for failover, or just go with a set number per datacentre?
>
> For automated install: how do you go about resolving dc & rack, and tokens per node (if the hardware varies)?
>
> Marc
>
> -----Original Message-----
> From: Bowen Song <bo...@bso.ng>
> Sent: Saturday, June 4, 2022 3:10 PM
> To: user@cassandra.apache.org
> Subject: Re: Cluster & Nodetool
>
> EXTERNAL
>
>
> That sounds like something caused by duplicated node IDs (the Host ID column in `nodetool status`). Did you by any chance copied the Cassandra data directory between nodes? (e.g. spinning up a new node from a VM snapshot that contains a non-empty data directory)
>
> On 03/06/2022 12:38, Marc Hoppins wrote:
>> Hi all,
>>
>> Am new to Cassandra.  Just finished installing on 22 nodes across 2 datacentres.
>>
>> If I run nodetool describecluster  I get
>>
>> Stats for all nodes:
>>           Live: 22
>>           Joining: 0
>>           Moving: 0
>>           Leaving: 0
>>           Unreachable: 0
>>
>> Data Centers:
>>           BA #Nodes: 9 #Down: 0
>>           DR1 #Nodes: 8 #Down: 0
>>
>> There should be 12 in BA and 10 in DR1.  The service is running on these other nodes...yet nodetool status also only shows the above numbers.
>>
>> Datacenter: BA
>> ==============
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address       Load        Tokens  Owns (effective)  Host ID                               Rack
>> UN  10.1.146.197  304.72 KiB  16      11.4%             26d5a89c-aa8f-4249-b2b5-82341cc214bc  SSW09
>> UN  10.1.146.186  245.02 KiB  16      9.0%              29f20519-51f9-493c-b891-930762d82231  SSW09
>> UN  10.1.146.20   129.53 KiB  16      12.5%             f90dd318-1357-46ca-9870-807d988658b3  SSW09
>> UN  10.1.146.200  150.31 KiB  16      11.1%             c544e85a-c2c5-4afd-aca8-1854a1723c2f  SSW09
>> UN  10.1.146.17   185.9 KiB   16      11.7%             db9d9856-3082-44a8-b292-156da1a17d0a  SSW09
>> UN  10.1.146.174  288.64 KiB  16      12.1%             03126eba-8b58-4a96-80ca-10cec2e18e69  SSW09
>> UN  10.1.146.199  146.71 KiB  16      13.7%             860d6549-94ab-4a07-b665-70ea7e53f41a  SSW09
>> UN  10.1.146.78   69.05 KiB   16      11.5%             7d9fdbab-40b0-4a9e-b0c9-4ffa822c42fd  SSW09
>> UN  10.1.146.67   304.5 KiB   16      13.6%             48e9eba2-9112-4d91-8f26-8272cb5ce7bc  SSW09
>>
>> Datacenter: DR1
>> ===============
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address       Load        Tokens  Owns (effective)  Host ID                               Rack
>> UN  10.1.146.137  209.33 KiB  16      12.6%             f65c685f-048c-41de-85e4-308c4b84d047  SSW02
>> UN  10.1.146.141  237.21 KiB  16      9.8%              847ad921-fceb-4cef-acec-1c918d2a6517  SSW02
>> UN  10.1.146.131  311.05 KiB  16      11.7%             7263f6c6-c4d6-438e-8ee7-d07666242ba0  SSW02
>> UN  10.1.146.139  283.33 KiB  16      11.5%             264cbe47-acb4-49cc-97d0-6f9e2cee6844  SSW02
>> UN  10.1.146.140  258.46 KiB  16      11.6%             43dbbe91-5dac-4c3a-9df5-2f5ccf268eb6  SSW02
>> UN  10.1.146.132  157.03 KiB  16      12.3%             1c0cb23c-af78-4fa2-bd92-20fa7d39ec30  SSW02
>> UN  10.1.146.135  301.13 KiB  16      11.2%             26159fbe-cf78-4c94-88e0-54773bcf7bed  SSW02
>> UN  10.1.146.130  305.16 KiB  16      12.5%             d6d6c490-551d-4a97-a93c-3b772b750d7d  SSW02
>>
>> So I restarted the service on one of the missing addresses. It appeared in the list but one other dropped off.  I tried this several times.  It seems I can only get 9 and 8 not 12 and 10.
>>
>> Anyone have an idea why this may be so?
>>
>> Thanks
>>
>> Marc

RE: Cluster & Nodetool

Posted by Marc Hoppins <ma...@eset.com>.

No. It transpires that, after seeing errors when running a start.yml for ansible, I decided to start all nodes again and when starting some assumed the same ID as others.

I resolved this by shutting down the service on the affected nodes, removing the data dirs. (these are all new nodes: no data) and restarted the service, one by one, making sure that the new node appeared before starting another.

All are now alive and kicking (copyright Simple Minds).

Vis. Seeds: given my setup only has a small number of nodes, I used 1 node out of 4 as a seed. I have seen folk suggesting every node (sounds excessive) and 1 per datacentre (seems unreliable), and also 3 seeds per datacentre...which could be adequate if all not in the same rack (which mine currently are).  What suggestions/best practice?  2 per switch/rack for failover, or just go with a set number per datacentre?

For automated install: how do you go about resolving dc & rack, and tokens per node (if the hardware varies)?

Marc

-----Original Message-----
From: Bowen Song <bo...@bso.ng> 
Sent: Saturday, June 4, 2022 3:10 PM
To: user@cassandra.apache.org
Subject: Re: Cluster & Nodetool

EXTERNAL


That sounds like something caused by duplicated node IDs (the Host ID column in `nodetool status`). Did you by any chance copied the Cassandra data directory between nodes? (e.g. spinning up a new node from a VM snapshot that contains a non-empty data directory)

On 03/06/2022 12:38, Marc Hoppins wrote:
> Hi all,
>
> Am new to Cassandra.  Just finished installing on 22 nodes across 2 datacentres.
>
> If I run nodetool describecluster  I get
>
> Stats for all nodes:
>          Live: 22
>          Joining: 0
>          Moving: 0
>          Leaving: 0
>          Unreachable: 0
>
> Data Centers:
>          BA #Nodes: 9 #Down: 0
>          DR1 #Nodes: 8 #Down: 0
>
> There should be 12 in BA and 10 in DR1.  The service is running on these other nodes...yet nodetool status also only shows the above numbers.
>
> Datacenter: BA
> ==============
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address       Load        Tokens  Owns (effective)  Host ID                               Rack
> UN  10.1.146.197  304.72 KiB  16      11.4%             26d5a89c-aa8f-4249-b2b5-82341cc214bc  SSW09
> UN  10.1.146.186  245.02 KiB  16      9.0%              29f20519-51f9-493c-b891-930762d82231  SSW09
> UN  10.1.146.20   129.53 KiB  16      12.5%             f90dd318-1357-46ca-9870-807d988658b3  SSW09
> UN  10.1.146.200  150.31 KiB  16      11.1%             c544e85a-c2c5-4afd-aca8-1854a1723c2f  SSW09
> UN  10.1.146.17   185.9 KiB   16      11.7%             db9d9856-3082-44a8-b292-156da1a17d0a  SSW09
> UN  10.1.146.174  288.64 KiB  16      12.1%             03126eba-8b58-4a96-80ca-10cec2e18e69  SSW09
> UN  10.1.146.199  146.71 KiB  16      13.7%             860d6549-94ab-4a07-b665-70ea7e53f41a  SSW09
> UN  10.1.146.78   69.05 KiB   16      11.5%             7d9fdbab-40b0-4a9e-b0c9-4ffa822c42fd  SSW09
> UN  10.1.146.67   304.5 KiB   16      13.6%             48e9eba2-9112-4d91-8f26-8272cb5ce7bc  SSW09
>
> Datacenter: DR1
> ===============
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address       Load        Tokens  Owns (effective)  Host ID                               Rack
> UN  10.1.146.137  209.33 KiB  16      12.6%             f65c685f-048c-41de-85e4-308c4b84d047  SSW02
> UN  10.1.146.141  237.21 KiB  16      9.8%              847ad921-fceb-4cef-acec-1c918d2a6517  SSW02
> UN  10.1.146.131  311.05 KiB  16      11.7%             7263f6c6-c4d6-438e-8ee7-d07666242ba0  SSW02
> UN  10.1.146.139  283.33 KiB  16      11.5%             264cbe47-acb4-49cc-97d0-6f9e2cee6844  SSW02
> UN  10.1.146.140  258.46 KiB  16      11.6%             43dbbe91-5dac-4c3a-9df5-2f5ccf268eb6  SSW02
> UN  10.1.146.132  157.03 KiB  16      12.3%             1c0cb23c-af78-4fa2-bd92-20fa7d39ec30  SSW02
> UN  10.1.146.135  301.13 KiB  16      11.2%             26159fbe-cf78-4c94-88e0-54773bcf7bed  SSW02
> UN  10.1.146.130  305.16 KiB  16      12.5%             d6d6c490-551d-4a97-a93c-3b772b750d7d  SSW02
>
> So I restarted the service on one of the missing addresses. It appeared in the list but one other dropped off.  I tried this several times.  It seems I can only get 9 and 8 not 12 and 10.
>
> Anyone have an idea why this may be so?
>
> Thanks
>
> Marc

Re: Cluster & Nodetool

Posted by Bowen Song <bo...@bso.ng>.

That sounds like something caused by duplicated node IDs (the Host ID 
column in `nodetool status`). Did you by any chance copied the Cassandra 
data directory between nodes? (e.g. spinning up a new node from a VM 
snapshot that contains a non-empty data directory)

On 03/06/2022 12:38, Marc Hoppins wrote:
> Hi all,
>
> Am new to Cassandra.  Just finished installing on 22 nodes across 2 datacentres.
>
> If I run nodetool describecluster  I get
>
> Stats for all nodes:
>          Live: 22
>          Joining: 0
>          Moving: 0
>          Leaving: 0
>          Unreachable: 0
>
> Data Centers:
>          BA #Nodes: 9 #Down: 0
>          DR1 #Nodes: 8 #Down: 0
>
> There should be 12 in BA and 10 in DR1.  The service is running on these other nodes...yet nodetool status also only shows the above numbers.
>
> Datacenter: BA
> ==============
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address       Load        Tokens  Owns (effective)  Host ID                               Rack
> UN  10.1.146.197  304.72 KiB  16      11.4%             26d5a89c-aa8f-4249-b2b5-82341cc214bc  SSW09
> UN  10.1.146.186  245.02 KiB  16      9.0%              29f20519-51f9-493c-b891-930762d82231  SSW09
> UN  10.1.146.20   129.53 KiB  16      12.5%             f90dd318-1357-46ca-9870-807d988658b3  SSW09
> UN  10.1.146.200  150.31 KiB  16      11.1%             c544e85a-c2c5-4afd-aca8-1854a1723c2f  SSW09
> UN  10.1.146.17   185.9 KiB   16      11.7%             db9d9856-3082-44a8-b292-156da1a17d0a  SSW09
> UN  10.1.146.174  288.64 KiB  16      12.1%             03126eba-8b58-4a96-80ca-10cec2e18e69  SSW09
> UN  10.1.146.199  146.71 KiB  16      13.7%             860d6549-94ab-4a07-b665-70ea7e53f41a  SSW09
> UN  10.1.146.78   69.05 KiB   16      11.5%             7d9fdbab-40b0-4a9e-b0c9-4ffa822c42fd  SSW09
> UN  10.1.146.67   304.5 KiB   16      13.6%             48e9eba2-9112-4d91-8f26-8272cb5ce7bc  SSW09
>
> Datacenter: DR1
> ===============
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address       Load        Tokens  Owns (effective)  Host ID                               Rack
> UN  10.1.146.137  209.33 KiB  16      12.6%             f65c685f-048c-41de-85e4-308c4b84d047  SSW02
> UN  10.1.146.141  237.21 KiB  16      9.8%              847ad921-fceb-4cef-acec-1c918d2a6517  SSW02
> UN  10.1.146.131  311.05 KiB  16      11.7%             7263f6c6-c4d6-438e-8ee7-d07666242ba0  SSW02
> UN  10.1.146.139  283.33 KiB  16      11.5%             264cbe47-acb4-49cc-97d0-6f9e2cee6844  SSW02
> UN  10.1.146.140  258.46 KiB  16      11.6%             43dbbe91-5dac-4c3a-9df5-2f5ccf268eb6  SSW02
> UN  10.1.146.132  157.03 KiB  16      12.3%             1c0cb23c-af78-4fa2-bd92-20fa7d39ec30  SSW02
> UN  10.1.146.135  301.13 KiB  16      11.2%             26159fbe-cf78-4c94-88e0-54773bcf7bed  SSW02
> UN  10.1.146.130  305.16 KiB  16      12.5%             d6d6c490-551d-4a97-a93c-3b772b750d7d  SSW02
>
> So I restarted the service on one of the missing addresses. It appeared in the list but one other dropped off.  I tried this several times.  It seems I can only get 9 and 8 not 12 and 10.
>
> Anyone have an idea why this may be so?
>
> Thanks
>
> Marc