You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Joe Obernberger <jo...@gmail.com> on 2022/01/07 20:51:02 UTC

removing a drive - 4.0.1

Hi All - I have a 13 node cluster running Cassandra 4.0.1.  If I stop a 
node, edit the cassandra.yaml file, comment out the first drive in the 
list, and restart the node, it fails to start saying that a node already 
exists in the cluster with the IP address.

If I put the drive back into the list, the node still fails to start 
with the same error.  At this point the node is useless and I think the 
only option is to remove all the data, and re-boostrap it?
---------

ERROR [main] 2022-01-07 15:50:09,155 CassandraDaemon.java:909 - 
Exception encountered during startup
java.lang.RuntimeException: A node with address /172.16.100.39:7000 
already exists, cancelling join. Use cassandra.replace_address if you 
want to replace this node.
         at 
org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:659)
         at 
org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:934)
         at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:784)
         at 
org.apache.cassandra.service.StorageService.initServer(StorageService.java:729)
         at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:420)
         at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:763)
         at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:887)

-----------

If I remove a drive other than the first one, this problem doesn't 
occur.  Any other options?  It appears that if it the first drive in the 
list goes bad, or is just removed, that entire node must be replaced.

-Joe

Re: removing a drive - 4.0.1

Posted by Joe Obernberger <jo...@gmail.com>.

Thank you Dmitry.
At this point the one node where I removed the first drive from the list 
and then rebuilt it, is now in some odd state.  Locally nodetool status 
shows it as up (UN), but all the other nodes in the cluster show it as 
down (DN).

Not sure what to do at this juncture.

-Joe

On 1/7/2022 4:38 PM, Dmitry Saprykin wrote:
> There is a jira ticket describing your situation
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-14793
>
> I may be wrong but is seems that system directories are pinned to 
> first data directory in cassandra.yaml by default. When you removed 
> first item from the list system data regenerated in the new first 
> directory in the list. And then merged??? when original first dir returned
>
> On Fri, Jan 7, 2022 at 4:23 PM Joe Obernberger 
> <jo...@gmail.com> wrote:
>
>     Hi - in order to get the node back up and running I did the following:
>     Deleted all data on the node:
>     Added: -Dcassandra.replace_address=172.16.100.39
>     to the cassandra.env.sh <http://cassandra.env.sh> file, and
>     started it up.  It is currently bootstrapping.
>
>     In cassandra.yaml, say you have the following:
>
>     data_file_directories:
>         - /data/1/cassandra
>         - /data/2/cassandra
>         - /data/3/cassandra
>         - /data/4/cassandra
>         - /data/5/cassandra
>         - /data/6/cassandra
>         - /data/7/cassandra
>         - /data/8/cassandra
>
>     If I change the above to:
>     #    - /data/1/cassandra
>         - /data/2/cassandra
>         - /data/3/cassandra
>         - /data/4/cassandra
>         - /data/5/cassandra
>         - /data/6/cassandra
>         - /data/7/cassandra
>         - /data/8/cassandra
>
>     the problem happens.  If I change it to:
>
>         - /data/1/cassandra
>         - /data/2/cassandra
>         - /data/3/cassandra
>         - /data/4/cassandra
>         - /data/5/cassandra
>         - /data/6/cassandra
>         - /data/7/cassandra
>     #    - /data/8/cassandra
>
>     the node starts up OK.  I assume it will recover the missing data
>     during a repair?
>
>     -Joe
>
>     On 1/7/2022 4:13 PM, Mano ksio wrote:
>>     Hi, you may have already tried, but this may help.
>>     https://stackoverflow.com/questions/29323709/unable-to-start-cassandra-node-already-exists
>>
>>
>>     can you be little narrate 'If I remove a drive other than the
>>     first one'? what does it means
>>
>>     On Fri, Jan 7, 2022 at 2:52 PM Joe Obernberger
>>     <jo...@gmail.com> wrote:
>>
>>         Hi All - I have a 13 node cluster running Cassandra 4.0.1. 
>>         If I stop a
>>         node, edit the cassandra.yaml file, comment out the first
>>         drive in the
>>         list, and restart the node, it fails to start saying that a
>>         node already
>>         exists in the cluster with the IP address.
>>
>>         If I put the drive back into the list, the node still fails
>>         to start
>>         with the same error.  At this point the node is useless and I
>>         think the
>>         only option is to remove all the data, and re-boostrap it?
>>         ---------
>>
>>         ERROR [main] 2022-01-07 15:50:09,155 CassandraDaemon.java:909 -
>>         Exception encountered during startup
>>         java.lang.RuntimeException: A node with address
>>         /172.16.100.39:7000 <http://172.16.100.39:7000>
>>         already exists, cancelling join. Use
>>         cassandra.replace_address if you
>>         want to replace this node.
>>                  at
>>         org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:659)
>>                  at
>>         org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:934)
>>                  at
>>         org.apache.cassandra.service.StorageService.initServer(StorageService.java:784)
>>                  at
>>         org.apache.cassandra.service.StorageService.initServer(StorageService.java:729)
>>                  at
>>         org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:420)
>>                  at
>>         org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:763)
>>                  at
>>         org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:887)
>>
>>         -----------
>>
>>         If I remove a drive other than the first one, this problem
>>         doesn't
>>         occur.  Any other options?  It appears that if it the first
>>         drive in the
>>         list goes bad, or is just removed, that entire node must be
>>         replaced.
>>
>>         -Joe
>>
>>
>>     <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>     	Virus-free. www.avg.com
>>     <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>
>>
>>     <#m_3361535422621871688_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>

Re: removing a drive - 4.0.1

Posted by Joe Obernberger <jo...@gmail.com>.

When a drive fails in a large cluster and you don't immediately have a 
replacement drive, is it OK to just remove the drive from cassandra.yaml 
and restart the node?  Will the missing data (assuming RF=3) be 
re-replicated?
I have disk_failure_policy set to "best_effort", but the node still 
fails (ie cassandra exits) when a disk (spinning rust) goes bad.
I do have commit_failure_policy set to stop.

Thank you!

-Joe

On 1/7/2022 4:38 PM, Dmitry Saprykin wrote:
> There is a jira ticket describing your situation
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-14793
>
> I may be wrong but is seems that system directories are pinned to 
> first data directory in cassandra.yaml by default. When you removed 
> first item from the list system data regenerated in the new first 
> directory in the list. And then merged??? when original first dir returned
>
> On Fri, Jan 7, 2022 at 4:23 PM Joe Obernberger 
> <jo...@gmail.com> wrote:
>
>     Hi - in order to get the node back up and running I did the following:
>     Deleted all data on the node:
>     Added: -Dcassandra.replace_address=172.16.100.39
>     to the cassandra.env.sh <http://cassandra.env.sh> file, and
>     started it up.  It is currently bootstrapping.
>
>     In cassandra.yaml, say you have the following:
>
>     data_file_directories:
>         - /data/1/cassandra
>         - /data/2/cassandra
>         - /data/3/cassandra
>         - /data/4/cassandra
>         - /data/5/cassandra
>         - /data/6/cassandra
>         - /data/7/cassandra
>         - /data/8/cassandra
>
>     If I change the above to:
>     #    - /data/1/cassandra
>         - /data/2/cassandra
>         - /data/3/cassandra
>         - /data/4/cassandra
>         - /data/5/cassandra
>         - /data/6/cassandra
>         - /data/7/cassandra
>         - /data/8/cassandra
>
>     the problem happens.  If I change it to:
>
>         - /data/1/cassandra
>         - /data/2/cassandra
>         - /data/3/cassandra
>         - /data/4/cassandra
>         - /data/5/cassandra
>         - /data/6/cassandra
>         - /data/7/cassandra
>     #    - /data/8/cassandra
>
>     the node starts up OK.  I assume it will recover the missing data
>     during a repair?
>
>     -Joe
>
>     On 1/7/2022 4:13 PM, Mano ksio wrote:
>>     Hi, you may have already tried, but this may help.
>>     https://stackoverflow.com/questions/29323709/unable-to-start-cassandra-node-already-exists
>>
>>
>>     can you be little narrate 'If I remove a drive other than the
>>     first one'? what does it means
>>
>>     On Fri, Jan 7, 2022 at 2:52 PM Joe Obernberger
>>     <jo...@gmail.com> wrote:
>>
>>         Hi All - I have a 13 node cluster running Cassandra 4.0.1. 
>>         If I stop a
>>         node, edit the cassandra.yaml file, comment out the first
>>         drive in the
>>         list, and restart the node, it fails to start saying that a
>>         node already
>>         exists in the cluster with the IP address.
>>
>>         If I put the drive back into the list, the node still fails
>>         to start
>>         with the same error.  At this point the node is useless and I
>>         think the
>>         only option is to remove all the data, and re-boostrap it?
>>         ---------
>>
>>         ERROR [main] 2022-01-07 15:50:09,155 CassandraDaemon.java:909 -
>>         Exception encountered during startup
>>         java.lang.RuntimeException: A node with address
>>         /172.16.100.39:7000 <http://172.16.100.39:7000>
>>         already exists, cancelling join. Use
>>         cassandra.replace_address if you
>>         want to replace this node.
>>                  at
>>         org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:659)
>>                  at
>>         org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:934)
>>                  at
>>         org.apache.cassandra.service.StorageService.initServer(StorageService.java:784)
>>                  at
>>         org.apache.cassandra.service.StorageService.initServer(StorageService.java:729)
>>                  at
>>         org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:420)
>>                  at
>>         org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:763)
>>                  at
>>         org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:887)
>>
>>         -----------
>>
>>         If I remove a drive other than the first one, this problem
>>         doesn't
>>         occur.  Any other options?  It appears that if it the first
>>         drive in the
>>         list goes bad, or is just removed, that entire node must be
>>         replaced.
>>
>>         -Joe
>>
>>
>>     <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>     	Virus-free. www.avg.com
>>     <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>
>>
>>     <#m_3361535422621871688_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>

-- 
This email has been checked for viruses by AVG.
https://www.avg.com

Re: removing a drive - 4.0.1

Posted by Dmitry Saprykin <sa...@gmail.com>.

There is a jira ticket describing your situation
https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-14793

I may be wrong but is seems that system directories are pinned to first
data directory in cassandra.yaml by default. When you removed first item
from the list system data regenerated in the new first directory in the
list. And then merged??? when original first dir returned

On Fri, Jan 7, 2022 at 4:23 PM Joe Obernberger <jo...@gmail.com>
wrote:

> Hi - in order to get the node back up and running I did the following:
> Deleted all data on the node:
> Added: -Dcassandra.replace_address=172.16.100.39
> to the cassandra.env.sh file, and started it up.  It is currently
> bootstrapping.
>
> In cassandra.yaml, say you have the following:
>
> data_file_directories:
>     - /data/1/cassandra
>     - /data/2/cassandra
>     - /data/3/cassandra
>     - /data/4/cassandra
>     - /data/5/cassandra
>     - /data/6/cassandra
>     - /data/7/cassandra
>     - /data/8/cassandra
>
> If I change the above to:
> #    - /data/1/cassandra
>     - /data/2/cassandra
>     - /data/3/cassandra
>     - /data/4/cassandra
>     - /data/5/cassandra
>     - /data/6/cassandra
>     - /data/7/cassandra
>     - /data/8/cassandra
>
> the problem happens.  If I change it to:
>
>     - /data/1/cassandra
>     - /data/2/cassandra
>     - /data/3/cassandra
>     - /data/4/cassandra
>     - /data/5/cassandra
>     - /data/6/cassandra
>     - /data/7/cassandra
> #    - /data/8/cassandra
>
> the node starts up OK.  I assume it will recover the missing data during a
> repair?
>
> -Joe
> On 1/7/2022 4:13 PM, Mano ksio wrote:
>
> Hi, you may have already tried, but this may help.
> https://stackoverflow.com/questions/29323709/unable-to-start-cassandra-node-already-exists
>
> can you be little narrate 'If I remove a drive other than the first one'?
> what does it means
>
> On Fri, Jan 7, 2022 at 2:52 PM Joe Obernberger <
> joseph.obernberger@gmail.com> wrote:
>
>> Hi All - I have a 13 node cluster running Cassandra 4.0.1.  If I stop a
>> node, edit the cassandra.yaml file, comment out the first drive in the
>> list, and restart the node, it fails to start saying that a node already
>> exists in the cluster with the IP address.
>>
>> If I put the drive back into the list, the node still fails to start
>> with the same error.  At this point the node is useless and I think the
>> only option is to remove all the data, and re-boostrap it?
>> ---------
>>
>> ERROR [main] 2022-01-07 15:50:09,155 CassandraDaemon.java:909 -
>> Exception encountered during startup
>> java.lang.RuntimeException: A node with address /172.16.100.39:7000
>> already exists, cancelling join. Use cassandra.replace_address if you
>> want to replace this node.
>>          at
>>
>> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:659)
>>          at
>>
>> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:934)
>>          at
>>
>> org.apache.cassandra.service.StorageService.initServer(StorageService.java:784)
>>          at
>>
>> org.apache.cassandra.service.StorageService.initServer(StorageService.java:729)
>>          at
>>
>> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:420)
>>          at
>>
>> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:763)
>>          at
>>
>> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:887)
>>
>> -----------
>>
>> If I remove a drive other than the first one, this problem doesn't
>> occur.  Any other options?  It appears that if it the first drive in the
>> list goes bad, or is just removed, that entire node must be replaced.
>>
>> -Joe
>>
>>
>
> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> Virus-free.
> www.avg.com
> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
> <#m_3361535422621871688_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>
>

Re: removing a drive - 4.0.1

Posted by Joe Obernberger <jo...@gmail.com>.

Hi - in order to get the node back up and running I did the following:
Deleted all data on the node:
Added: -Dcassandra.replace_address=172.16.100.39
to the cassandra.env.sh file, and started it up.  It is currently 
bootstrapping.

In cassandra.yaml, say you have the following:

data_file_directories:
     - /data/1/cassandra
     - /data/2/cassandra
     - /data/3/cassandra
     - /data/4/cassandra
     - /data/5/cassandra
     - /data/6/cassandra
     - /data/7/cassandra
     - /data/8/cassandra

If I change the above to:
#    - /data/1/cassandra
     - /data/2/cassandra
     - /data/3/cassandra
     - /data/4/cassandra
     - /data/5/cassandra
     - /data/6/cassandra
     - /data/7/cassandra
     - /data/8/cassandra

the problem happens.  If I change it to:

     - /data/1/cassandra
     - /data/2/cassandra
     - /data/3/cassandra
     - /data/4/cassandra
     - /data/5/cassandra
     - /data/6/cassandra
     - /data/7/cassandra
#    - /data/8/cassandra

the node starts up OK.  I assume it will recover the missing data during 
a repair?

-Joe

On 1/7/2022 4:13 PM, Mano ksio wrote:
> Hi, you may have already tried, but this may help. 
> https://stackoverflow.com/questions/29323709/unable-to-start-cassandra-node-already-exists 
>
>
> can you be little narrate 'If I remove a drive other than the first 
> one'? what does it means
>
> On Fri, Jan 7, 2022 at 2:52 PM Joe Obernberger 
> <jo...@gmail.com> wrote:
>
>     Hi All - I have a 13 node cluster running Cassandra 4.0.1.  If I
>     stop a
>     node, edit the cassandra.yaml file, comment out the first drive in
>     the
>     list, and restart the node, it fails to start saying that a node
>     already
>     exists in the cluster with the IP address.
>
>     If I put the drive back into the list, the node still fails to start
>     with the same error.  At this point the node is useless and I
>     think the
>     only option is to remove all the data, and re-boostrap it?
>     ---------
>
>     ERROR [main] 2022-01-07 15:50:09,155 CassandraDaemon.java:909 -
>     Exception encountered during startup
>     java.lang.RuntimeException: A node with address
>     /172.16.100.39:7000 <http://172.16.100.39:7000>
>     already exists, cancelling join. Use cassandra.replace_address if you
>     want to replace this node.
>              at
>     org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:659)
>              at
>     org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:934)
>              at
>     org.apache.cassandra.service.StorageService.initServer(StorageService.java:784)
>              at
>     org.apache.cassandra.service.StorageService.initServer(StorageService.java:729)
>              at
>     org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:420)
>              at
>     org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:763)
>              at
>     org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:887)
>
>     -----------
>
>     If I remove a drive other than the first one, this problem doesn't
>     occur.  Any other options?  It appears that if it the first drive
>     in the
>     list goes bad, or is just removed, that entire node must be replaced.
>
>     -Joe
>
>
> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> 
> 	Virus-free. www.avg.com 
> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> 
>
>
> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

Re: removing a drive - 4.0.1

Posted by Mano ksio <ma...@gmail.com>.

Hi, you may have already tried, but this may help.
https://stackoverflow.com/questions/29323709/unable-to-start-cassandra-node-already-exists

can you be little narrate 'If I remove a drive other than the first one'?
what does it means

On Fri, Jan 7, 2022 at 2:52 PM Joe Obernberger <jo...@gmail.com>
wrote:

> Hi All - I have a 13 node cluster running Cassandra 4.0.1.  If I stop a
> node, edit the cassandra.yaml file, comment out the first drive in the
> list, and restart the node, it fails to start saying that a node already
> exists in the cluster with the IP address.
>
> If I put the drive back into the list, the node still fails to start
> with the same error.  At this point the node is useless and I think the
> only option is to remove all the data, and re-boostrap it?
> ---------
>
> ERROR [main] 2022-01-07 15:50:09,155 CassandraDaemon.java:909 -
> Exception encountered during startup
> java.lang.RuntimeException: A node with address /172.16.100.39:7000
> already exists, cancelling join. Use cassandra.replace_address if you
> want to replace this node.
>          at
>
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:659)
>          at
>
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:934)
>          at
>
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:784)
>          at
>
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:729)
>          at
>
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:420)
>          at
>
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:763)
>          at
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:887)
>
> -----------
>
> If I remove a drive other than the first one, this problem doesn't
> occur.  Any other options?  It appears that if it the first drive in the
> list goes bad, or is just removed, that entire node must be replaced.
>
> -Joe
>
>