You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Joe Obernberger <jo...@gmail.com> on 2023/01/16 15:27:45 UTC
Failed disks - correct procedure
Hi all - what is the correct procedure when handling a failed disk?
Have a node in a 15 node cluster. This node has 16 drives and cassandra
data is split across them. One drive is failing. Can I just remove it
from the list and cassandra will then replicate? If not - what?
Thank you!
-Joe
--
This email has been checked for viruses by AVG antivirus software.
www.avg.com
Re: Failed disks - correct procedure
Posted by Joe Obernberger <jo...@gmail.com>.
Some more observations. If the first drive fails on a node, then you
can't just remove it from the list. Example:
We have:
/data/1/cassandra
/data/2/cassandra
/data/3/cassandra
/data/4/cassandra
...
If /data/1 fails, and I remove it from the list, when you try to start
cassandra on that node it says there already exists a node with that
address and you need to replace it. I think the only option at that
point it to bootstrap it and use the replace_address option.
-Joe
On 1/17/2023 10:41 AM, C. Scott Andreas wrote:
> Bumping this note from Andy downthread to make sure everyone has seen
> it and is aware:
>
> “Before you do that, you will want to make sure a cycle of repairs has
> run on the replicas of the down node to ensure they are consistent
> with each other.”
>
> When replacing an instance, it’s necessary to run repair (incremental
> or full) among the surviving replicas *before* bootstrapping a
> replacement instance in. If you don’t do this, Cassandra’s quorum
> consistency guarantees won’t be met and data may appear to be lost.
> It’s not possible to use Cassandra as a consistent database without
> doing so.
>
> Given replicas A, B, C, and replacement replica A*:
> - Quorum write is witnessed by A, B
> - A fails
> - A* is bootstrapped in without repair of B, C
> - Quorum read succeeds against A*, C
> - The successful quorum read will not observe data from the previous
> successful quorum write and the data will appear to be lost.
>
> Repairing surviving replicas before bootstrapping a replacement node
> is necessary to avoid this.
>
> — Scott
>
>> On Jan 17, 2023, at 7:28 AM, Joe Obernberger
>> <jo...@gmail.com> wrote:
>>
>>
>>
>> I come from the hadoop world where we have a cluster with probably
>> over 500 drives. Drives fail all the time; or well several a year
>> anyway. We remove that single drive from HDFS, HDFS re-balances, and
>> when we get around to it, we swap in a new drive, format it, and add
>> it back to HDFS. We keep the OS drives separate from the data drives
>> and ensure that the OS volume is in a RAID mirror. It's painful when
>> OS drives fail, so mirror works. When space is low, we add another
>> node with lots of disks.
>> We are repurposing this same hardware to run a large Cassandra
>> cluster. I'd love it if Cassandra could support larger individual
>> nodes, but we've been trying to configure it with lots of disks for
>> redundancy, with the idea that we won't use an entire nodes storage
>> only for Cassandra. As was mentioned a long while back, blades seem
>> to make more sense for Cassandra than single nodes with lots of disk,
>> but we've got what we've got!
>> :)
>>
>> So far, no issues with:
>> Stop node, remove drive from cassandra config, start node, run repair
>> - version 4.1.
>>
>> -Joe
>>
>> On 1/17/2023 10:11 AM, Durity, Sean R via user wrote:
>>>
>>> For physical hardware when disks fail, I do a removenode, wait for
>>> the drive to be replaced, reinstall Cassandra, and then bootstrap
>>> the node back in (and run clean-up across the DC).
>>>
>>> All of our disks are presented as one file system for data, which is
>>> not what the original question was asking.
>>>
>>> Sean R. Durity
>>>
>>> *From:*Marc Hoppins <ma...@eset.com>
>>> *Sent:* Tuesday, January 17, 2023 3:57 AM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* [EXTERNAL] RE: Failed disks - correct procedure
>>>
>>> HI all, I was pondering this very situation. We have a node with a
>>> crapped-out disk (not the first time). Removenode vs repairnode: in
>>> regard time, there is going to be little difference twixt replacing
>>> a dead node and removing then re-installing
>>>
>>> INTERNAL USE
>>>
>>> HI all,
>>> I was pondering this very situation.
>>> We have a node with a crapped-out disk (not the first time).
>>> Removenode vs repairnode: in regard time, there is going to be
>>> little difference twixt replacing a dead node and removing then
>>> re-installing a node. There is going to be a bunch of reads/writes
>>> and verifications (or similar) which is going to take a similar
>>> amount of time...or do I read that wrong?
>>> For myself, I just go with removenode and then rejoin after HDD has
>>> bee replaced. Usually the fix exceeds the wait time and the node is
>>> then out of the system anyway.
>>> -----Original Message-----
>>> From: Joe Obernberger <jo...@gmail.com>
>>> Sent: Monday, January 16, 2023 6:31 PM
>>> To: Jeff Jirsa <jj...@gmail.com>; user@cassandra.apache.org
>>> Subject: Re: Failed disks - correct procedure
>>> EXTERNAL
>>> I'm using 4.1.0-1.
>>> I've been doing a lot of truncates lately before the drive failed
>>> (research project). Current drives have about 100GBytes of data
>>> each, although the actual amount of data in Cassandra is much less
>>> (because of truncates and snapshots). The cluster is not
>>> homo-genius; some nodes have more drives than others.
>>> nodetool status -r
>>> Datacenter: datacenter1
>>> =======================
>>> Status=Up/Down
>>> |/ State=Normal/Leaving/Joining/Moving
>>> -- Address Load Tokens Owns Host
>>> ID Rack
>>> UN nyx.querymasters.com 7.9 GiB 250 ?
>>> 07bccfce-45f1-41a3-a5c4-ee748a7a9b98 rack1
>>> UN enceladus.querymasters.com 6.34 GiB 200 ?
>>> 274a6e8d-de37-4e0b-b000-02d221d858a5 rack1
>>> UN aion.querymasters.com 6.31 GiB 200 ?
>>> 59150c47-274a-46fb-9d5e-bed468d36797 rack1
>>> UN calypso.querymasters.com 6.26 GiB 200 ?
>>> e83aa851-69b4-478f-88f6-60e657ea6539 rack1
>>> UN fortuna.querymasters.com 7.1 GiB 200 ?
>>> 49e4f571-7d1c-4e1e-aca7-5bbe076596f7 rack1
>>> UN kratos.querymasters.com 6.36 GiB 200 ?
>>> 0d9509cc-2f23-4117-a883-469a1be54baf rack1
>>> UN charon.querymasters.com 6.35 GiB 200 ?
>>> d9702f96-256e-45ae-8e12-69a42712be50 rack1
>>> UN eros.querymasters.com 6.4 GiB 200 ?
>>> 93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47 rack1
>>> UN ursula.querymasters.com 6.24 GiB 200 ?
>>> 4bbbe57c-6219-41e5-bbac-de92a9594d53 rack1
>>> UN gaia.querymasters.com 6.28 GiB 200 ?
>>> b2e5366e-8386-40ec-a641-27944a5a7cfa rack1
>>> UN chaos.querymasters.com 3.78 GiB 120 ?
>>> 08a19658-40be-4e55-8709-812b3d4ac750 rack1
>>> UN pallas.querymasters.com 6.24 GiB 200 ?
>>> b74b6e65-af63-486a-b07f-9e304ec30a39 rack1
>>> UN paradigm7.querymasters.com 16.25 GiB 500 ?
>>> 1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297 rack1
>>> UN aether.querymasters.com 6.36 GiB 200 ?
>>> 352fd049-32f8-4be8-9275-68b145ac2832 rack1
>>> UN athena.querymasters.com 15.85 GiB 500 ?
>>> b088a8e6-42f3-4331-a583-47ef5149598f rack1
>>> -Joe
>>> On 1/16/2023 12:23 PM, Jeff Jirsa wrote:
>>> > Prior to cassandra-6696 you’d have to treat one missing disk as a
>>> > failed machine, wipe all the data and re-stream it, as a tombstone for
>>> > a given value may be on one disk and data on another (effectively
>>> > redirecting data)
>>> >
>>> > So the answer has to be version dependent, too - which version were you using?
>>> >
>>> >> On Jan 16, 2023, at 9:08 AM, Tolbert, Andy <x...@andrewtolbert.com> wrote:
>>> >>
>>> >> Hi Joe,
>>> >>
>>> >> Reading it back I realized I misunderstood that part of your email,
>>> >> so you must be using data_file_directories with 16 drives? That's a
>>> >> lot of drives! I imagine this may happen from time to time given
>>> >> that disks like to fail.
>>> >>
>>> >> That's a bit of an interesting scenario that I would have to think
>>> >> about. If you brought the node up without the bad drive, repairs are
>>> >> probably going to do a ton of repair overstreaming if you aren't
>>> >> using
>>> >> 4.0 (https://urldefense.com/v3/__https://issues.apache.org/jira/browse/CASSANDRA-3200__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOS5AaQVTU$
>>> <https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/CASSANDRA-3200__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOS5AaQVTU$>)
>>> which may
>>> >> put things into a really bad state (lots of streaming = lots of
>>> >> compactions = slower reads) and you may be seeing some inconsistency
>>> >> if repairs weren't regularly running beforehand.
>>> >>
>>> >> How much data was on the drive that failed? How much data do you
>>> >> usually have per node?
>>> >>
>>> >> Thanks,
>>> >> Andy
>>> >>
>>> >>> On Mon, Jan 16, 2023 at 10:59 AM Joe Obernberger
>>> >>> <jo...@gmail.com> wrote:
>>> >>>
>>> >>> Thank you Andy.
>>> >>> Is there a way to just remove the drive from the cluster and replace
>>> >>> it later? Ordering replacement drives isn't a fast process...
>>> >>> What I've done so far is:
>>> >>> Stop node
>>> >>> Remove drive reference from /etc/cassandra/conf/cassandra.yaml
>>> >>> Restart node
>>> >>> Run repair
>>> >>>
>>> >>> Will that work? Right now, it's showing all nodes as up.
>>> >>>
>>> >>> -Joe
>>> >>>
>>> >>>> On 1/16/2023 11:55 AM, Tolbert, Andy wrote:
>>> >>>> Hi Joe,
>>> >>>>
>>> >>>> I'd recommend just doing a replacement, bringing up a new node with
>>> >>>> -Dcassandra.replace_address_first_boot=ip.you.are.replacing as
>>> >>>> described here:
>>> >>>> https://urldefense.com/v3/__https://cassandra.apache.org/doc/4.1/cassandra/operating/topo_chang__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOSUkY8zuQ$
>>> <https://urldefense.com/v3/__https:/cassandra.apache.org/doc/4.1/cassandra/operating/topo_chang__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOSUkY8zuQ$>
>>> >>>> es.html#replacing-a-dead-node
>>> >>>>
>>> >>>> Before you do that, you will want to make sure a cycle of repairs
>>> >>>> has run on the replicas of the down node to ensure they are
>>> >>>> consistent with each other.
>>> >>>>
>>> >>>> Make sure you also have 'auto_bootstrap: true' in the yaml of the
>>> >>>> node you are replacing and that the initial_token matches the node
>>> >>>> you are replacing (If you are not using vnodes) so the node doesn't
>>> >>>> skip bootstrapping. This is the default, but felt worth mentioning.
>>> >>>>
>>> >>>> You can also remove the dead node, which should stream data to
>>> >>>> replicas that will pick up new ranges, but you also will want to do
>>> >>>> repairs ahead of time too. To be honest it's not something I've
>>> >>>> done recently, so I'm not as confident on executing that procedure.
>>> >>>>
>>> >>>> Thanks,
>>> >>>> Andy
>>> >>>>
>>> >>>>
>>> >>>> On Mon, Jan 16, 2023 at 9:28 AM Joe Obernberger
>>> >>>> <jo...@gmail.com> wrote:
>>> >>>>> Hi all - what is the correct procedure when handling a failed disk?
>>> >>>>> Have a node in a 15 node cluster. This node has 16 drives and
>>> >>>>> cassandra data is split across them. One drive is failing. Can I
>>> >>>>> just remove it from the list and cassandra will then replicate? If not - what?
>>> >>>>> Thank you!
>>> >>>>>
>>> >>>>> -Joe
>>> >>>>>
>>> >>>>>
>>> >>>>> --
>>> >>>>> This email has been checked for viruses by AVG antivirus software.
>>> >>>>> https://urldefense.com/v3/__http://www.avg.com__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOSyg3IaPU$
>>> <https://urldefense.com/v3/__http:/www.avg.com__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOSyg3IaPU$>
>>> >>> --
>>> >>> This email has been checked for viruses by AVG antivirus software.
>>> >>> https://urldefense.com/v3/__http://www.avg.com__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOSyg3IaPU$
>>> <https://urldefense.com/v3/__http:/www.avg.com__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOSyg3IaPU$>
>>> --
>>> This email has been checked for viruses by AVG antivirus software.
>>> https://urldefense.com/v3/__http://www.avg.com__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOSyg3IaPU$
>>> <https://urldefense.com/v3/__http:/www.avg.com__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOSyg3IaPU$>
>>
>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>> Virus-free.www.avg.com
>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>
>>
>> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
--
This email has been checked for viruses by AVG antivirus software.
www.avg.com
Re: Failed disks - correct procedure
Posted by "C. Scott Andreas" <sc...@paradoxica.net>.
Bumping this note from Andy downthread to make sure everyone has seen it and
is aware:
“Before you do that, you will want to make sure a cycle of repairs has run on
the replicas of the down node to ensure they are consistent with each other.”
When replacing an instance, it’s necessary to run repair (incremental or full)
among the surviving replicas *before* bootstrapping a replacement instance in.
If you don’t do this, Cassandra’s quorum consistency guarantees won’t be met
and data may appear to be lost. It’s not possible to use Cassandra as a
consistent database without doing so.
Given replicas A, B, C, and replacement replica A*:
\- Quorum write is witnessed by A, B
\- A fails
\- A* is bootstrapped in without repair of B, C
\- Quorum read succeeds against A*, C
\- The successful quorum read will not observe data from the previous
successful quorum write and the data will appear to be lost.
Repairing surviving replicas before bootstrapping a replacement node is
necessary to avoid this.
— Scott
> On Jan 17, 2023, at 7:28 AM, Joe Obernberger <jo...@gmail.com>
> wrote:
>
>
>
>
> I come from the hadoop world where we have a cluster with probably over 500
> drives. Drives fail all the time; or well several a year anyway. We remove
> that single drive from HDFS, HDFS re-balances, and when we get around to it,
> we swap in a new drive, format it, and add it back to HDFS. We keep the OS
> drives separate from the data drives and ensure that the OS volume is in a
> RAID mirror. It's painful when OS drives fail, so mirror works. When space
> is low, we add another node with lots of disks.
> We are repurposing this same hardware to run a large Cassandra cluster.
> I'd love it if Cassandra could support larger individual nodes, but we've
> been trying to configure it with lots of disks for redundancy, with the idea
> that we won't use an entire nodes storage only for Cassandra. As was
> mentioned a long while back, blades seem to make more sense for Cassandra
> than single nodes with lots of disk, but we've got what we've got!
> :)
>
> So far, no issues with:
> Stop node, remove drive from cassandra config, start node, run repair -
> version 4.1.
>
>
> -Joe
>
>
> On 1/17/2023 10:11 AM, Durity, Sean R via user wrote:
>
>
>> For physical hardware when disks fail, I do a removenode, wait for the
drive to be replaced, reinstall Cassandra, and then bootstrap the node back in
(and run clean-up across the DC).
>>
>>
>>
>> All of our disks are presented as one file system for data, which is not
what the original question was asking.
>>
>>
>>
>> Sean R. Durity
>>
>> **From:** Marc Hoppins
[<ma...@eset.com>](mailto:marc.hoppins@eset.com)
> **Sent:** Tuesday, January 17, 2023 3:57 AM
> **To:** [user@cassandra.apache.org](mailto:user@cassandra.apache.org)
> **Subject:** [EXTERNAL] RE: Failed disks - correct procedure
>>
>>
>>
>> HI all, I was pondering this very situation. We have a node with a crapped-
out disk (not the first time). Removenode vs repairnode: in regard time, there
is going to be little difference twixt replacing a dead node and removing then
re-installing
>>
>>
>>
>> INTERNAL USE
>>
>>
>> HI all,
>>
>>
>>
>>
>>
>> I was pondering this very situation.
>>
>>
>>
>>
>>
>> We have a node with a crapped-out disk (not the first time). Removenode
vs repairnode: in regard time, there is going to be little difference twixt
replacing a dead node and removing then re-installing a node. There is going
to be a bunch of reads/writes and verifications (or similar) which is going to
take a similar amount of time...or do I read that wrong?
>>
>>
>>
>>
>>
>> For myself, I just go with removenode and then rejoin after HDD has bee
replaced. Usually the fix exceeds the wait time and the node is then out of
the system anyway.
>>
>>
>>
>>
>>
>> -----Original Message-----
>>
>>
>> From: Joe Obernberger
<[joseph.obernberger@gmail.com](mailto:joseph.obernberger@gmail.com)>
>>
>>
>> Sent: Monday, January 16, 2023 6:31 PM
>>
>>
>> To: Jeff Jirsa <[jjirsa@gmail.com](mailto:jjirsa@gmail.com)>;
[user@cassandra.apache.org](mailto:user@cassandra.apache.org)
>>
>>
>> Subject: Re: Failed disks - correct procedure
>>
>>
>>
>>
>>
>> EXTERNAL
>>
>>
>>
>>
>>
>>
>>
>>
>> I'm using 4.1.0-1.
>>
>>
>> I've been doing a lot of truncates lately before the drive failed
(research project). Current drives have about 100GBytes of data each,
although the actual amount of data in Cassandra is much less (because of
truncates and snapshots). The cluster is not homo-genius; some nodes have
more drives than others.
>>
>>
>>
>>
>>
>> nodetool status -r
>>
>>
>> Datacenter: datacenter1
>>
>>
>> =======================
>>
>>
>> Status=Up/Down
>>
>>
>> |/ State=Normal/Leaving/Joining/Moving
>>
>>
>> -- Address Load Tokens Owns Host
>>
>>
>> ID Rack
>>
>>
>> UN nyx.querymasters.com 7.9 GiB 250 ?
>>
>>
>> 07bccfce-45f1-41a3-a5c4-ee748a7a9b98 rack1
>>
>>
>> UN enceladus.querymasters.com 6.34 GiB 200 ?
>>
>>
>> 274a6e8d-de37-4e0b-b000-02d221d858a5 rack1
>>
>>
>> UN aion.querymasters.com 6.31 GiB 200 ?
>>
>>
>> 59150c47-274a-46fb-9d5e-bed468d36797 rack1
>>
>>
>> UN calypso.querymasters.com 6.26 GiB 200 ?
>>
>>
>> e83aa851-69b4-478f-88f6-60e657ea6539 rack1
>>
>>
>> UN fortuna.querymasters.com 7.1 GiB 200 ?
>>
>>
>> 49e4f571-7d1c-4e1e-aca7-5bbe076596f7 rack1
>>
>>
>> UN kratos.querymasters.com 6.36 GiB 200 ?
>>
>>
>> 0d9509cc-2f23-4117-a883-469a1be54baf rack1
>>
>>
>> UN charon.querymasters.com 6.35 GiB 200 ?
>>
>>
>> d9702f96-256e-45ae-8e12-69a42712be50 rack1
>>
>>
>> UN eros.querymasters.com 6.4 GiB 200 ?
>>
>>
>> 93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47 rack1
>>
>>
>> UN ursula.querymasters.com 6.24 GiB 200 ?
>>
>>
>> 4bbbe57c-6219-41e5-bbac-de92a9594d53 rack1
>>
>>
>> UN gaia.querymasters.com 6.28 GiB 200 ?
>>
>>
>> b2e5366e-8386-40ec-a641-27944a5a7cfa rack1
>>
>>
>> UN chaos.querymasters.com 3.78 GiB 120 ?
>>
>>
>> 08a19658-40be-4e55-8709-812b3d4ac750 rack1
>>
>>
>> UN pallas.querymasters.com 6.24 GiB 200 ?
>>
>>
>> b74b6e65-af63-486a-b07f-9e304ec30a39 rack1
>>
>>
>> UN paradigm7.querymasters.com 16.25 GiB 500 ?
>>
>>
>> 1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297 rack1
>>
>>
>> UN aether.querymasters.com 6.36 GiB 200 ?
>>
>>
>> 352fd049-32f8-4be8-9275-68b145ac2832 rack1
>>
>>
>> UN athena.querymasters.com 15.85 GiB 500 ?
>>
>>
>> b088a8e6-42f3-4331-a583-47ef5149598f rack1
>>
>>
>>
>>
>>
>> -Joe
>>
>>
>>
>>
>>
>> On 1/16/2023 12:23 PM, Jeff Jirsa wrote:
>>
>>
>> > Prior to cassandra-6696 you’d have to treat one missing disk as a
>>
>>
>> > failed machine, wipe all the data and re-stream it, as a tombstone
for
>>
>>
>> > a given value may be on one disk and data on another (effectively
>>
>>
>> > redirecting data)
>>
>>
>> >
>>
>>
>> > So the answer has to be version dependent, too - which version were
you using?
>>
>>
>> >
>>
>>
>> >> On Jan 16, 2023, at 9:08 AM, Tolbert, Andy
<[x@andrewtolbert.com](mailto:x@andrewtolbert.com)> wrote:
>>
>>
>> >>
>>
>>
>> >> Hi Joe,
>>
>>
>> >>
>>
>>
>> >> Reading it back I realized I misunderstood that part of your email,
>>
>>
>> >> so you must be using data_file_directories with 16 drives? That's a
>>
>>
>> >> lot of drives! I imagine this may happen from time to time given
>>
>>
>> >> that disks like to fail.
>>
>>
>> >>
>>
>>
>> >> That's a bit of an interesting scenario that I would have to think
>>
>>
>> >> about. If you brought the node up without the bad drive, repairs
are
>>
>>
>> >> probably going to do a ton of repair overstreaming if you aren't
>>
>>
>> >> using
>>
>>
>> >> 4.0
([https://urldefense.com/v3/__https://issues.apache.org/jira/browse/CASSANDRA-3200__;!!M-nmYVHPHQ!NmM-
JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-
mB1K9PsPJhfCb0ZCrgVxL7EkOS5AaQVTU$](https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/CASSANDRA-3200__;!!M-nmYVHPHQ!NmM-
JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-
mB1K9PsPJhfCb0ZCrgVxL7EkOS5AaQVTU$)) which may
>>
>>
>> >> put things into a really bad state (lots of streaming = lots of
>>
>>
>> >> compactions = slower reads) and you may be seeing some inconsistency
>>
>>
>> >> if repairs weren't regularly running beforehand.
>>
>>
>> >>
>>
>>
>> >> How much data was on the drive that failed? How much data do you
>>
>>
>> >> usually have per node?
>>
>>
>> >>
>>
>>
>> >> Thanks,
>>
>>
>> >> Andy
>>
>>
>> >>
>>
>>
>> >>> On Mon, Jan 16, 2023 at 10:59 AM Joe Obernberger
>>
>>
>> >>>
<[joseph.obernberger@gmail.com](mailto:joseph.obernberger@gmail.com)> wrote:
>>
>>
>> >>>
>>
>>
>> >>> Thank you Andy.
>>
>>
>> >>> Is there a way to just remove the drive from the cluster and
replace
>>
>>
>> >>> it later? Ordering replacement drives isn't a fast process...
>>
>>
>> >>> What I've done so far is:
>>
>>
>> >>> Stop node
>>
>>
>> >>> Remove drive reference from /etc/cassandra/conf/cassandra.yaml
>>
>>
>> >>> Restart node
>>
>>
>> >>> Run repair
>>
>>
>> >>>
>>
>>
>> >>> Will that work? Right now, it's showing all nodes as up.
>>
>>
>> >>>
>>
>>
>> >>> -Joe
>>
>>
>> >>>
>>
>>
>> >>>> On 1/16/2023 11:55 AM, Tolbert, Andy wrote:
>>
>>
>> >>>> Hi Joe,
>>
>>
>> >>>>
>>
>>
>> >>>> I'd recommend just doing a replacement, bringing up a new node
with
>>
>>
>> >>>> -Dcassandra.replace_address_first_boot=ip.you.are.replacing as
>>
>>
>> >>>> described here:
>>
>>
>> >>>>
[https://urldefense.com/v3/__https://cassandra.apache.org/doc/4.1/cassandra/operating/topo_chang__;!!M-nmYVHPHQ!NmM-
JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-
mB1K9PsPJhfCb0ZCrgVxL7EkOSUkY8zuQ$](https://urldefense.com/v3/__https:/cassandra.apache.org/doc/4.1/cassandra/operating/topo_chang__;!!M-nmYVHPHQ!NmM-
JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-
mB1K9PsPJhfCb0ZCrgVxL7EkOSUkY8zuQ$)
>>
>>
>> >>>> es.html#replacing-a-dead-node
>>
>>
>> >>>>
>>
>>
>> >>>> Before you do that, you will want to make sure a cycle of repairs
>>
>>
>> >>>> has run on the replicas of the down node to ensure they are
>>
>>
>> >>>> consistent with each other.
>>
>>
>> >>>>
>>
>>
>> >>>> Make sure you also have 'auto_bootstrap: true' in the yaml of the
>>
>>
>> >>>> node you are replacing and that the initial_token matches the node
>>
>>
>> >>>> you are replacing (If you are not using vnodes) so the node
doesn't
>>
>>
>> >>>> skip bootstrapping. This is the default, but felt worth
mentioning.
>>
>>
>> >>>>
>>
>>
>> >>>> You can also remove the dead node, which should stream data to
>>
>>
>> >>>> replicas that will pick up new ranges, but you also will want to
do
>>
>>
>> >>>> repairs ahead of time too. To be honest it's not something I've
>>
>>
>> >>>> done recently, so I'm not as confident on executing that
procedure.
>>
>>
>> >>>>
>>
>>
>> >>>> Thanks,
>>
>>
>> >>>> Andy
>>
>>
>> >>>>
>>
>>
>> >>>>
>>
>>
>> >>>> On Mon, Jan 16, 2023 at 9:28 AM Joe Obernberger
>>
>>
>> >>>>
<[joseph.obernberger@gmail.com](mailto:joseph.obernberger@gmail.com)> wrote:
>>
>>
>> >>>>> Hi all - what is the correct procedure when handling a failed
disk?
>>
>>
>> >>>>> Have a node in a 15 node cluster. This node has 16 drives and
>>
>>
>> >>>>> cassandra data is split across them. One drive is failing. Can
I
>>
>>
>> >>>>> just remove it from the list and cassandra will then replicate?
If not - what?
>>
>>
>> >>>>> Thank you!
>>
>>
>> >>>>>
>>
>>
>> >>>>> -Joe
>>
>>
>> >>>>>
>>
>>
>> >>>>>
>>
>>
>> >>>>> --
>>
>>
>> >>>>> This email has been checked for viruses by AVG antivirus
software.
>>
>>
>> >>>>>
[https://urldefense.com/v3/__http://www.avg.com__;!!M-nmYVHPHQ!NmM-JuBX-
GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-
mB1K9PsPJhfCb0ZCrgVxL7EkOSyg3IaPU$](https://urldefense.com/v3/__http:/www.avg.com__;!!M-nmYVHPHQ!NmM-
JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-
mB1K9PsPJhfCb0ZCrgVxL7EkOSyg3IaPU$)
>>
>>
>> >>> --
>>
>>
>> >>> This email has been checked for viruses by AVG antivirus software.
>>
>>
>> >>> [https://urldefense.com/v3/__http://www.avg.com__;!!M-nmYVHPHQ!NmM-
JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-
mB1K9PsPJhfCb0ZCrgVxL7EkOSyg3IaPU$](https://urldefense.com/v3/__http:/www.avg.com__;!!M-nmYVHPHQ!NmM-
JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-
mB1K9PsPJhfCb0ZCrgVxL7EkOSyg3IaPU$)
>>
>>
>>
>>
>>
>> --
>>
>>
>> This email has been checked for viruses by AVG antivirus software.
>>
>>
>> [https://urldefense.com/v3/__http://www.avg.com__;!!M-nmYVHPHQ!NmM-
JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-
mB1K9PsPJhfCb0ZCrgVxL7EkOSyg3IaPU$](https://urldefense.com/v3/__http:/www.avg.com__;!!M-nmYVHPHQ!NmM-
JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-
mB1K9PsPJhfCb0ZCrgVxL7EkOSyg3IaPU$)
>
>
> [![](https://s-install.avcdn.net/ipm/preview/icons/icon-envelope-tick-green-
> avg-v1.png)](http://www.avg.com/email-
> signature?utm_medium=email&utm_source=link&utm_campaign=sig-
> email&utm_content=emailclient)| Virus-
> free.[www.avg.com](http://www.avg.com/email-
> signature?utm_medium=email&utm_source=link&utm_campaign=sig-
> email&utm_content=emailclient)
> ---|---
Re: Failed disks - correct procedure
Posted by Joe Obernberger <jo...@gmail.com>.
I come from the hadoop world where we have a cluster with probably over
500 drives. Drives fail all the time; or well several a year anyway.
We remove that single drive from HDFS, HDFS re-balances, and when we get
around to it, we swap in a new drive, format it, and add it back to
HDFS. We keep the OS drives separate from the data drives and ensure
that the OS volume is in a RAID mirror. It's painful when OS drives
fail, so mirror works. When space is low, we add another node with lots
of disks.
We are repurposing this same hardware to run a large Cassandra cluster.
I'd love it if Cassandra could support larger individual nodes, but
we've been trying to configure it with lots of disks for redundancy,
with the idea that we won't use an entire nodes storage only for
Cassandra. As was mentioned a long while back, blades seem to make more
sense for Cassandra than single nodes with lots of disk, but we've got
what we've got!
:)
So far, no issues with:
Stop node, remove drive from cassandra config, start node, run repair -
version 4.1.
-Joe
On 1/17/2023 10:11 AM, Durity, Sean R via user wrote:
>
> For physical hardware when disks fail, I do a removenode, wait for the
> drive to be replaced, reinstall Cassandra, and then bootstrap the node
> back in (and run clean-up across the DC).
>
> All of our disks are presented as one file system for data, which is
> not what the original question was asking.
>
> Sean R. Durity
>
> *From:*Marc Hoppins <ma...@eset.com>
> *Sent:* Tuesday, January 17, 2023 3:57 AM
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] RE: Failed disks - correct procedure
>
> HI all, I was pondering this very situation. We have a node with a
> crapped-out disk (not the first time). Removenode vs repairnode: in
> regard time, there is going to be little difference twixt replacing a
> dead node and removing then re-installing
>
> INTERNAL USE
>
> HI all,
> I was pondering this very situation.
> We have a node with a crapped-out disk (not the first time).
> Removenode vs repairnode: in regard time, there is going to be little
> difference twixt replacing a dead node and removing then re-installing
> a node. There is going to be a bunch of reads/writes and
> verifications (or similar) which is going to take a similar amount of
> time...or do I read that wrong?
> For myself, I just go with removenode and then rejoin after HDD has
> bee replaced. Usually the fix exceeds the wait time and the node is
> then out of the system anyway.
> -----Original Message-----
> From: Joe Obernberger <jo...@gmail.com>
> Sent: Monday, January 16, 2023 6:31 PM
> To: Jeff Jirsa <jj...@gmail.com>; user@cassandra.apache.org
> Subject: Re: Failed disks - correct procedure
> EXTERNAL
> I'm using 4.1.0-1.
> I've been doing a lot of truncates lately before the drive failed
> (research project). Current drives have about 100GBytes of data each,
> although the actual amount of data in Cassandra is much less (because
> of truncates and snapshots). The cluster is not homo-genius; some
> nodes have more drives than others.
> nodetool status -r
> Datacenter: datacenter1
> =======================
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> -- Address Load Tokens Owns Host
> ID Rack
> UN nyx.querymasters.com 7.9 GiB 250 ?
> 07bccfce-45f1-41a3-a5c4-ee748a7a9b98 rack1
> UN enceladus.querymasters.com 6.34 GiB 200 ?
> 274a6e8d-de37-4e0b-b000-02d221d858a5 rack1
> UN aion.querymasters.com 6.31 GiB 200 ?
> 59150c47-274a-46fb-9d5e-bed468d36797 rack1
> UN calypso.querymasters.com 6.26 GiB 200 ?
> e83aa851-69b4-478f-88f6-60e657ea6539 rack1
> UN fortuna.querymasters.com 7.1 GiB 200 ?
> 49e4f571-7d1c-4e1e-aca7-5bbe076596f7 rack1
> UN kratos.querymasters.com 6.36 GiB 200 ?
> 0d9509cc-2f23-4117-a883-469a1be54baf rack1
> UN charon.querymasters.com 6.35 GiB 200 ?
> d9702f96-256e-45ae-8e12-69a42712be50 rack1
> UN eros.querymasters.com 6.4 GiB 200 ?
> 93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47 rack1
> UN ursula.querymasters.com 6.24 GiB 200 ?
> 4bbbe57c-6219-41e5-bbac-de92a9594d53 rack1
> UN gaia.querymasters.com 6.28 GiB 200 ?
> b2e5366e-8386-40ec-a641-27944a5a7cfa rack1
> UN chaos.querymasters.com 3.78 GiB 120 ?
> 08a19658-40be-4e55-8709-812b3d4ac750 rack1
> UN pallas.querymasters.com 6.24 GiB 200 ?
> b74b6e65-af63-486a-b07f-9e304ec30a39 rack1
> UN paradigm7.querymasters.com 16.25 GiB 500 ?
> 1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297 rack1
> UN aether.querymasters.com 6.36 GiB 200 ?
> 352fd049-32f8-4be8-9275-68b145ac2832 rack1
> UN athena.querymasters.com 15.85 GiB 500 ?
> b088a8e6-42f3-4331-a583-47ef5149598f rack1
> -Joe
> On 1/16/2023 12:23 PM, Jeff Jirsa wrote:
> > Prior to cassandra-6696 you’d have to treat one missing disk as a
> > failed machine, wipe all the data and re-stream it, as a tombstone for
> > a given value may be on one disk and data on another (effectively
> > redirecting data)
> >
> > So the answer has to be version dependent, too - which version were you using?
> >
> >> On Jan 16, 2023, at 9:08 AM, Tolbert, Andy <x...@andrewtolbert.com> wrote:
> >>
> >> Hi Joe,
> >>
> >> Reading it back I realized I misunderstood that part of your email,
> >> so you must be using data_file_directories with 16 drives? That's a
> >> lot of drives! I imagine this may happen from time to time given
> >> that disks like to fail.
> >>
> >> That's a bit of an interesting scenario that I would have to think
> >> about. If you brought the node up without the bad drive, repairs are
> >> probably going to do a ton of repair overstreaming if you aren't
> >> using
> >> 4.0 (https://urldefense.com/v3/__https://issues.apache.org/jira/browse/CASSANDRA-3200__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOS5AaQVTU$
> <https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/CASSANDRA-3200__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOS5AaQVTU$>)
> which may
> >> put things into a really bad state (lots of streaming = lots of
> >> compactions = slower reads) and you may be seeing some inconsistency
> >> if repairs weren't regularly running beforehand.
> >>
> >> How much data was on the drive that failed? How much data do you
> >> usually have per node?
> >>
> >> Thanks,
> >> Andy
> >>
> >>> On Mon, Jan 16, 2023 at 10:59 AM Joe Obernberger
> >>> <jo...@gmail.com> wrote:
> >>>
> >>> Thank you Andy.
> >>> Is there a way to just remove the drive from the cluster and replace
> >>> it later? Ordering replacement drives isn't a fast process...
> >>> What I've done so far is:
> >>> Stop node
> >>> Remove drive reference from /etc/cassandra/conf/cassandra.yaml
> >>> Restart node
> >>> Run repair
> >>>
> >>> Will that work? Right now, it's showing all nodes as up.
> >>>
> >>> -Joe
> >>>
> >>>> On 1/16/2023 11:55 AM, Tolbert, Andy wrote:
> >>>> Hi Joe,
> >>>>
> >>>> I'd recommend just doing a replacement, bringing up a new node with
> >>>> -Dcassandra.replace_address_first_boot=ip.you.are.replacing as
> >>>> described here:
> >>>> https://urldefense.com/v3/__https://cassandra.apache.org/doc/4.1/cassandra/operating/topo_chang__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOSUkY8zuQ$
> <https://urldefense.com/v3/__https:/cassandra.apache.org/doc/4.1/cassandra/operating/topo_chang__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOSUkY8zuQ$>
> >>>> es.html#replacing-a-dead-node
> >>>>
> >>>> Before you do that, you will want to make sure a cycle of repairs
> >>>> has run on the replicas of the down node to ensure they are
> >>>> consistent with each other.
> >>>>
> >>>> Make sure you also have 'auto_bootstrap: true' in the yaml of the
> >>>> node you are replacing and that the initial_token matches the node
> >>>> you are replacing (If you are not using vnodes) so the node doesn't
> >>>> skip bootstrapping. This is the default, but felt worth mentioning.
> >>>>
> >>>> You can also remove the dead node, which should stream data to
> >>>> replicas that will pick up new ranges, but you also will want to do
> >>>> repairs ahead of time too. To be honest it's not something I've
> >>>> done recently, so I'm not as confident on executing that procedure.
> >>>>
> >>>> Thanks,
> >>>> Andy
> >>>>
> >>>>
> >>>> On Mon, Jan 16, 2023 at 9:28 AM Joe Obernberger
> >>>> <jo...@gmail.com> wrote:
> >>>>> Hi all - what is the correct procedure when handling a failed disk?
> >>>>> Have a node in a 15 node cluster. This node has 16 drives and
> >>>>> cassandra data is split across them. One drive is failing. Can I
> >>>>> just remove it from the list and cassandra will then replicate? If not - what?
> >>>>> Thank you!
> >>>>>
> >>>>> -Joe
> >>>>>
> >>>>>
> >>>>> --
> >>>>> This email has been checked for viruses by AVG antivirus software.
> >>>>> https://urldefense.com/v3/__http://www.avg.com__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOSyg3IaPU$
> <https://urldefense.com/v3/__http:/www.avg.com__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOSyg3IaPU$>
> >>> --
> >>> This email has been checked for viruses by AVG antivirus software.
> >>> https://urldefense.com/v3/__http://www.avg.com__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOSyg3IaPU$
> <https://urldefense.com/v3/__http:/www.avg.com__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOSyg3IaPU$>
> --
> This email has been checked for viruses by AVG antivirus software.
> https://urldefense.com/v3/__http://www.avg.com__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOSyg3IaPU$
> <https://urldefense.com/v3/__http:/www.avg.com__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOSyg3IaPU$>
--
This email has been checked for viruses by AVG antivirus software.
www.avg.com
RE: Failed disks - correct procedure
Posted by "Durity, Sean R via user" <us...@cassandra.apache.org>.
For physical hardware when disks fail, I do a removenode, wait for the drive to be replaced, reinstall Cassandra, and then bootstrap the node back in (and run clean-up across the DC).
All of our disks are presented as one file system for data, which is not what the original question was asking.
Sean R. Durity
From: Marc Hoppins <ma...@eset.com>
Sent: Tuesday, January 17, 2023 3:57 AM
To: user@cassandra.apache.org
Subject: [EXTERNAL] RE: Failed disks - correct procedure
HI all, I was pondering this very situation. We have a node with a crapped-out disk (not the first time). Removenode vs repairnode: in regard time, there is going to be little difference twixt replacing a dead node and removing then re-installing
INTERNAL USE
HI all,
I was pondering this very situation.
We have a node with a crapped-out disk (not the first time). Removenode vs repairnode: in regard time, there is going to be little difference twixt replacing a dead node and removing then re-installing a node. There is going to be a bunch of reads/writes and verifications (or similar) which is going to take a similar amount of time...or do I read that wrong?
For myself, I just go with removenode and then rejoin after HDD has bee replaced. Usually the fix exceeds the wait time and the node is then out of the system anyway.
-----Original Message-----
From: Joe Obernberger <jo...@gmail.com>>
Sent: Monday, January 16, 2023 6:31 PM
To: Jeff Jirsa <jj...@gmail.com>>; user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Failed disks - correct procedure
EXTERNAL
I'm using 4.1.0-1.
I've been doing a lot of truncates lately before the drive failed (research project). Current drives have about 100GBytes of data each, although the actual amount of data in Cassandra is much less (because of truncates and snapshots). The cluster is not homo-genius; some nodes have more drives than others.
nodetool status -r
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host
ID Rack
UN nyx.querymasters.com 7.9 GiB 250 ?
07bccfce-45f1-41a3-a5c4-ee748a7a9b98 rack1
UN enceladus.querymasters.com 6.34 GiB 200 ?
274a6e8d-de37-4e0b-b000-02d221d858a5 rack1
UN aion.querymasters.com 6.31 GiB 200 ?
59150c47-274a-46fb-9d5e-bed468d36797 rack1
UN calypso.querymasters.com 6.26 GiB 200 ?
e83aa851-69b4-478f-88f6-60e657ea6539 rack1
UN fortuna.querymasters.com 7.1 GiB 200 ?
49e4f571-7d1c-4e1e-aca7-5bbe076596f7 rack1
UN kratos.querymasters.com 6.36 GiB 200 ?
0d9509cc-2f23-4117-a883-469a1be54baf rack1
UN charon.querymasters.com 6.35 GiB 200 ?
d9702f96-256e-45ae-8e12-69a42712be50 rack1
UN eros.querymasters.com 6.4 GiB 200 ?
93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47 rack1
UN ursula.querymasters.com 6.24 GiB 200 ?
4bbbe57c-6219-41e5-bbac-de92a9594d53 rack1
UN gaia.querymasters.com 6.28 GiB 200 ?
b2e5366e-8386-40ec-a641-27944a5a7cfa rack1
UN chaos.querymasters.com 3.78 GiB 120 ?
08a19658-40be-4e55-8709-812b3d4ac750 rack1
UN pallas.querymasters.com 6.24 GiB 200 ?
b74b6e65-af63-486a-b07f-9e304ec30a39 rack1
UN paradigm7.querymasters.com 16.25 GiB 500 ?
1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297 rack1
UN aether.querymasters.com 6.36 GiB 200 ?
352fd049-32f8-4be8-9275-68b145ac2832 rack1
UN athena.querymasters.com 15.85 GiB 500 ?
b088a8e6-42f3-4331-a583-47ef5149598f rack1
-Joe
On 1/16/2023 12:23 PM, Jeff Jirsa wrote:
> Prior to cassandra-6696 you’d have to treat one missing disk as a
> failed machine, wipe all the data and re-stream it, as a tombstone for
> a given value may be on one disk and data on another (effectively
> redirecting data)
>
> So the answer has to be version dependent, too - which version were you using?
>
>> On Jan 16, 2023, at 9:08 AM, Tolbert, Andy <x@...@andrewtolbert.com>> wrote:
>>
>> Hi Joe,
>>
>> Reading it back I realized I misunderstood that part of your email,
>> so you must be using data_file_directories with 16 drives? That's a
>> lot of drives! I imagine this may happen from time to time given
>> that disks like to fail.
>>
>> That's a bit of an interesting scenario that I would have to think
>> about. If you brought the node up without the bad drive, repairs are
>> probably going to do a ton of repair overstreaming if you aren't
>> using
>> 4.0 (https://urldefense.com/v3/__https://issues.apache.org/jira/browse/CASSANDRA-3200__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOS5AaQVTU$<https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/CASSANDRA-3200__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOS5AaQVTU$>) which may
>> put things into a really bad state (lots of streaming = lots of
>> compactions = slower reads) and you may be seeing some inconsistency
>> if repairs weren't regularly running beforehand.
>>
>> How much data was on the drive that failed? How much data do you
>> usually have per node?
>>
>> Thanks,
>> Andy
>>
>>> On Mon, Jan 16, 2023 at 10:59 AM Joe Obernberger
>>> <jo...@gmail.com>> wrote:
>>>
>>> Thank you Andy.
>>> Is there a way to just remove the drive from the cluster and replace
>>> it later? Ordering replacement drives isn't a fast process...
>>> What I've done so far is:
>>> Stop node
>>> Remove drive reference from /etc/cassandra/conf/cassandra.yaml
>>> Restart node
>>> Run repair
>>>
>>> Will that work? Right now, it's showing all nodes as up.
>>>
>>> -Joe
>>>
>>>> On 1/16/2023 11:55 AM, Tolbert, Andy wrote:
>>>> Hi Joe,
>>>>
>>>> I'd recommend just doing a replacement, bringing up a new node with
>>>> -Dcassandra.replace_address_first_boot=ip.you.are.replacing as
>>>> described here:
>>>> https://urldefense.com/v3/__https://cassandra.apache.org/doc/4.1/cassandra/operating/topo_chang__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOSUkY8zuQ$<https://urldefense.com/v3/__https:/cassandra.apache.org/doc/4.1/cassandra/operating/topo_chang__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOSUkY8zuQ$>
>>>> es.html#replacing-a-dead-node
>>>>
>>>> Before you do that, you will want to make sure a cycle of repairs
>>>> has run on the replicas of the down node to ensure they are
>>>> consistent with each other.
>>>>
>>>> Make sure you also have 'auto_bootstrap: true' in the yaml of the
>>>> node you are replacing and that the initial_token matches the node
>>>> you are replacing (If you are not using vnodes) so the node doesn't
>>>> skip bootstrapping. This is the default, but felt worth mentioning.
>>>>
>>>> You can also remove the dead node, which should stream data to
>>>> replicas that will pick up new ranges, but you also will want to do
>>>> repairs ahead of time too. To be honest it's not something I've
>>>> done recently, so I'm not as confident on executing that procedure.
>>>>
>>>> Thanks,
>>>> Andy
>>>>
>>>>
>>>> On Mon, Jan 16, 2023 at 9:28 AM Joe Obernberger
>>>> <jo...@gmail.com>> wrote:
>>>>> Hi all - what is the correct procedure when handling a failed disk?
>>>>> Have a node in a 15 node cluster. This node has 16 drives and
>>>>> cassandra data is split across them. One drive is failing. Can I
>>>>> just remove it from the list and cassandra will then replicate? If not - what?
>>>>> Thank you!
>>>>>
>>>>> -Joe
>>>>>
>>>>>
>>>>> --
>>>>> This email has been checked for viruses by AVG antivirus software.
>>>>> https://urldefense.com/v3/__http://www.avg.com__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOSyg3IaPU$<https://urldefense.com/v3/__http:/www.avg.com__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOSyg3IaPU$>
>>> --
>>> This email has been checked for viruses by AVG antivirus software.
>>> https://urldefense.com/v3/__http://www.avg.com__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOSyg3IaPU$<https://urldefense.com/v3/__http:/www.avg.com__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOSyg3IaPU$>
--
This email has been checked for viruses by AVG antivirus software.
https://urldefense.com/v3/__http://www.avg.com__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOSyg3IaPU$<https://urldefense.com/v3/__http:/www.avg.com__;!!M-nmYVHPHQ!NmM-JuBX-GTYHt0XeaEWNz7saGfIvnRUEAy3HG6hX_i0bdaIzpo4ceBTx-mB1K9PsPJhfCb0ZCrgVxL7EkOSyg3IaPU$>
RE: Failed disks - correct procedure
Posted by Marc Hoppins <ma...@eset.com>.
HI all,
I was pondering this very situation.
We have a node with a crapped-out disk (not the first time). Removenode vs repairnode: in regard time, there is going to be little difference twixt replacing a dead node and removing then re-installing a node. There is going to be a bunch of reads/writes and verifications (or similar) which is going to take a similar amount of time...or do I read that wrong?
For myself, I just go with removenode and then rejoin after HDD has bee replaced. Usually the fix exceeds the wait time and the node is then out of the system anyway.
-----Original Message-----
From: Joe Obernberger <jo...@gmail.com>
Sent: Monday, January 16, 2023 6:31 PM
To: Jeff Jirsa <jj...@gmail.com>; user@cassandra.apache.org
Subject: Re: Failed disks - correct procedure
EXTERNAL
I'm using 4.1.0-1.
I've been doing a lot of truncates lately before the drive failed (research project). Current drives have about 100GBytes of data each, although the actual amount of data in Cassandra is much less (because of truncates and snapshots). The cluster is not homo-genius; some nodes have more drives than others.
nodetool status -r
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host
ID Rack
UN nyx.querymasters.com 7.9 GiB 250 ?
07bccfce-45f1-41a3-a5c4-ee748a7a9b98 rack1
UN enceladus.querymasters.com 6.34 GiB 200 ?
274a6e8d-de37-4e0b-b000-02d221d858a5 rack1
UN aion.querymasters.com 6.31 GiB 200 ?
59150c47-274a-46fb-9d5e-bed468d36797 rack1
UN calypso.querymasters.com 6.26 GiB 200 ?
e83aa851-69b4-478f-88f6-60e657ea6539 rack1
UN fortuna.querymasters.com 7.1 GiB 200 ?
49e4f571-7d1c-4e1e-aca7-5bbe076596f7 rack1
UN kratos.querymasters.com 6.36 GiB 200 ?
0d9509cc-2f23-4117-a883-469a1be54baf rack1
UN charon.querymasters.com 6.35 GiB 200 ?
d9702f96-256e-45ae-8e12-69a42712be50 rack1
UN eros.querymasters.com 6.4 GiB 200 ?
93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47 rack1
UN ursula.querymasters.com 6.24 GiB 200 ?
4bbbe57c-6219-41e5-bbac-de92a9594d53 rack1
UN gaia.querymasters.com 6.28 GiB 200 ?
b2e5366e-8386-40ec-a641-27944a5a7cfa rack1
UN chaos.querymasters.com 3.78 GiB 120 ?
08a19658-40be-4e55-8709-812b3d4ac750 rack1
UN pallas.querymasters.com 6.24 GiB 200 ?
b74b6e65-af63-486a-b07f-9e304ec30a39 rack1
UN paradigm7.querymasters.com 16.25 GiB 500 ?
1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297 rack1
UN aether.querymasters.com 6.36 GiB 200 ?
352fd049-32f8-4be8-9275-68b145ac2832 rack1
UN athena.querymasters.com 15.85 GiB 500 ?
b088a8e6-42f3-4331-a583-47ef5149598f rack1
-Joe
On 1/16/2023 12:23 PM, Jeff Jirsa wrote:
> Prior to cassandra-6696 you’d have to treat one missing disk as a
> failed machine, wipe all the data and re-stream it, as a tombstone for
> a given value may be on one disk and data on another (effectively
> redirecting data)
>
> So the answer has to be version dependent, too - which version were you using?
>
>> On Jan 16, 2023, at 9:08 AM, Tolbert, Andy <x...@andrewtolbert.com> wrote:
>>
>> Hi Joe,
>>
>> Reading it back I realized I misunderstood that part of your email,
>> so you must be using data_file_directories with 16 drives? That's a
>> lot of drives! I imagine this may happen from time to time given
>> that disks like to fail.
>>
>> That's a bit of an interesting scenario that I would have to think
>> about. If you brought the node up without the bad drive, repairs are
>> probably going to do a ton of repair overstreaming if you aren't
>> using
>> 4.0 (https://issues.apache.org/jira/browse/CASSANDRA-3200) which may
>> put things into a really bad state (lots of streaming = lots of
>> compactions = slower reads) and you may be seeing some inconsistency
>> if repairs weren't regularly running beforehand.
>>
>> How much data was on the drive that failed? How much data do you
>> usually have per node?
>>
>> Thanks,
>> Andy
>>
>>> On Mon, Jan 16, 2023 at 10:59 AM Joe Obernberger
>>> <jo...@gmail.com> wrote:
>>>
>>> Thank you Andy.
>>> Is there a way to just remove the drive from the cluster and replace
>>> it later? Ordering replacement drives isn't a fast process...
>>> What I've done so far is:
>>> Stop node
>>> Remove drive reference from /etc/cassandra/conf/cassandra.yaml
>>> Restart node
>>> Run repair
>>>
>>> Will that work? Right now, it's showing all nodes as up.
>>>
>>> -Joe
>>>
>>>> On 1/16/2023 11:55 AM, Tolbert, Andy wrote:
>>>> Hi Joe,
>>>>
>>>> I'd recommend just doing a replacement, bringing up a new node with
>>>> -Dcassandra.replace_address_first_boot=ip.you.are.replacing as
>>>> described here:
>>>> https://cassandra.apache.org/doc/4.1/cassandra/operating/topo_chang
>>>> es.html#replacing-a-dead-node
>>>>
>>>> Before you do that, you will want to make sure a cycle of repairs
>>>> has run on the replicas of the down node to ensure they are
>>>> consistent with each other.
>>>>
>>>> Make sure you also have 'auto_bootstrap: true' in the yaml of the
>>>> node you are replacing and that the initial_token matches the node
>>>> you are replacing (If you are not using vnodes) so the node doesn't
>>>> skip bootstrapping. This is the default, but felt worth mentioning.
>>>>
>>>> You can also remove the dead node, which should stream data to
>>>> replicas that will pick up new ranges, but you also will want to do
>>>> repairs ahead of time too. To be honest it's not something I've
>>>> done recently, so I'm not as confident on executing that procedure.
>>>>
>>>> Thanks,
>>>> Andy
>>>>
>>>>
>>>> On Mon, Jan 16, 2023 at 9:28 AM Joe Obernberger
>>>> <jo...@gmail.com> wrote:
>>>>> Hi all - what is the correct procedure when handling a failed disk?
>>>>> Have a node in a 15 node cluster. This node has 16 drives and
>>>>> cassandra data is split across them. One drive is failing. Can I
>>>>> just remove it from the list and cassandra will then replicate? If not - what?
>>>>> Thank you!
>>>>>
>>>>> -Joe
>>>>>
>>>>>
>>>>> --
>>>>> This email has been checked for viruses by AVG antivirus software.
>>>>> www.avg.com
>>> --
>>> This email has been checked for viruses by AVG antivirus software.
>>> www.avg.com
--
This email has been checked for viruses by AVG antivirus software.
www.avg.com
Re: Failed disks - correct procedure
Posted by Joe Obernberger <jo...@gmail.com>.
I'm using 4.1.0-1.
I've been doing a lot of truncates lately before the drive failed
(research project). Current drives have about 100GBytes of data each,
although the actual amount of data in Cassandra is much less (because of
truncates and snapshots). The cluster is not homo-genius; some nodes
have more drives than others.
nodetool status -r
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host
ID Rack
UN nyx.querymasters.com 7.9 GiB 250 ?
07bccfce-45f1-41a3-a5c4-ee748a7a9b98 rack1
UN enceladus.querymasters.com 6.34 GiB 200 ?
274a6e8d-de37-4e0b-b000-02d221d858a5 rack1
UN aion.querymasters.com 6.31 GiB 200 ?
59150c47-274a-46fb-9d5e-bed468d36797 rack1
UN calypso.querymasters.com 6.26 GiB 200 ?
e83aa851-69b4-478f-88f6-60e657ea6539 rack1
UN fortuna.querymasters.com 7.1 GiB 200 ?
49e4f571-7d1c-4e1e-aca7-5bbe076596f7 rack1
UN kratos.querymasters.com 6.36 GiB 200 ?
0d9509cc-2f23-4117-a883-469a1be54baf rack1
UN charon.querymasters.com 6.35 GiB 200 ?
d9702f96-256e-45ae-8e12-69a42712be50 rack1
UN eros.querymasters.com 6.4 GiB 200 ?
93f9cb0f-ea71-4e3d-b62a-f0ea0e888c47 rack1
UN ursula.querymasters.com 6.24 GiB 200 ?
4bbbe57c-6219-41e5-bbac-de92a9594d53 rack1
UN gaia.querymasters.com 6.28 GiB 200 ?
b2e5366e-8386-40ec-a641-27944a5a7cfa rack1
UN chaos.querymasters.com 3.78 GiB 120 ?
08a19658-40be-4e55-8709-812b3d4ac750 rack1
UN pallas.querymasters.com 6.24 GiB 200 ?
b74b6e65-af63-486a-b07f-9e304ec30a39 rack1
UN paradigm7.querymasters.com 16.25 GiB 500 ?
1ccd2cc5-3ee5-43c5-a8c3-7065bdc24297 rack1
UN aether.querymasters.com 6.36 GiB 200 ?
352fd049-32f8-4be8-9275-68b145ac2832 rack1
UN athena.querymasters.com 15.85 GiB 500 ?
b088a8e6-42f3-4331-a583-47ef5149598f rack1
-Joe
On 1/16/2023 12:23 PM, Jeff Jirsa wrote:
> Prior to cassandra-6696 you’d have to treat one missing disk as a failed machine, wipe all the data and re-stream it, as a tombstone for a given value may be on one disk and data on another (effectively redirecting data)
>
> So the answer has to be version dependent, too - which version were you using?
>
>> On Jan 16, 2023, at 9:08 AM, Tolbert, Andy <x...@andrewtolbert.com> wrote:
>>
>> Hi Joe,
>>
>> Reading it back I realized I misunderstood that part of your email, so
>> you must be using data_file_directories with 16 drives? That's a lot
>> of drives! I imagine this may happen from time to time given that
>> disks like to fail.
>>
>> That's a bit of an interesting scenario that I would have to think
>> about. If you brought the node up without the bad drive, repairs are
>> probably going to do a ton of repair overstreaming if you aren't using
>> 4.0 (https://issues.apache.org/jira/browse/CASSANDRA-3200) which may
>> put things into a really bad state (lots of streaming = lots of
>> compactions = slower reads) and you may be seeing some inconsistency
>> if repairs weren't regularly running beforehand.
>>
>> How much data was on the drive that failed? How much data do you
>> usually have per node?
>>
>> Thanks,
>> Andy
>>
>>> On Mon, Jan 16, 2023 at 10:59 AM Joe Obernberger
>>> <jo...@gmail.com> wrote:
>>>
>>> Thank you Andy.
>>> Is there a way to just remove the drive from the cluster and replace it
>>> later? Ordering replacement drives isn't a fast process...
>>> What I've done so far is:
>>> Stop node
>>> Remove drive reference from /etc/cassandra/conf/cassandra.yaml
>>> Restart node
>>> Run repair
>>>
>>> Will that work? Right now, it's showing all nodes as up.
>>>
>>> -Joe
>>>
>>>> On 1/16/2023 11:55 AM, Tolbert, Andy wrote:
>>>> Hi Joe,
>>>>
>>>> I'd recommend just doing a replacement, bringing up a new node with
>>>> -Dcassandra.replace_address_first_boot=ip.you.are.replacing as
>>>> described here:
>>>> https://cassandra.apache.org/doc/4.1/cassandra/operating/topo_changes.html#replacing-a-dead-node
>>>>
>>>> Before you do that, you will want to make sure a cycle of repairs has
>>>> run on the replicas of the down node to ensure they are consistent
>>>> with each other.
>>>>
>>>> Make sure you also have 'auto_bootstrap: true' in the yaml of the node
>>>> you are replacing and that the initial_token matches the node you are
>>>> replacing (If you are not using vnodes) so the node doesn't skip
>>>> bootstrapping. This is the default, but felt worth mentioning.
>>>>
>>>> You can also remove the dead node, which should stream data to
>>>> replicas that will pick up new ranges, but you also will want to do
>>>> repairs ahead of time too. To be honest it's not something I've done
>>>> recently, so I'm not as confident on executing that procedure.
>>>>
>>>> Thanks,
>>>> Andy
>>>>
>>>>
>>>> On Mon, Jan 16, 2023 at 9:28 AM Joe Obernberger
>>>> <jo...@gmail.com> wrote:
>>>>> Hi all - what is the correct procedure when handling a failed disk?
>>>>> Have a node in a 15 node cluster. This node has 16 drives and cassandra
>>>>> data is split across them. One drive is failing. Can I just remove it
>>>>> from the list and cassandra will then replicate? If not - what?
>>>>> Thank you!
>>>>>
>>>>> -Joe
>>>>>
>>>>>
>>>>> --
>>>>> This email has been checked for viruses by AVG antivirus software.
>>>>> www.avg.com
>>> --
>>> This email has been checked for viruses by AVG antivirus software.
>>> www.avg.com
--
This email has been checked for viruses by AVG antivirus software.
www.avg.com
Re: Failed disks - correct procedure
Posted by Jeff Jirsa <jj...@gmail.com>.
Prior to cassandra-6696 you’d have to treat one missing disk as a failed machine, wipe all the data and re-stream it, as a tombstone for a given value may be on one disk and data on another (effectively redirecting data)
So the answer has to be version dependent, too - which version were you using?
> On Jan 16, 2023, at 9:08 AM, Tolbert, Andy <x...@andrewtolbert.com> wrote:
>
> Hi Joe,
>
> Reading it back I realized I misunderstood that part of your email, so
> you must be using data_file_directories with 16 drives? That's a lot
> of drives! I imagine this may happen from time to time given that
> disks like to fail.
>
> That's a bit of an interesting scenario that I would have to think
> about. If you brought the node up without the bad drive, repairs are
> probably going to do a ton of repair overstreaming if you aren't using
> 4.0 (https://issues.apache.org/jira/browse/CASSANDRA-3200) which may
> put things into a really bad state (lots of streaming = lots of
> compactions = slower reads) and you may be seeing some inconsistency
> if repairs weren't regularly running beforehand.
>
> How much data was on the drive that failed? How much data do you
> usually have per node?
>
> Thanks,
> Andy
>
>> On Mon, Jan 16, 2023 at 10:59 AM Joe Obernberger
>> <jo...@gmail.com> wrote:
>>
>> Thank you Andy.
>> Is there a way to just remove the drive from the cluster and replace it
>> later? Ordering replacement drives isn't a fast process...
>> What I've done so far is:
>> Stop node
>> Remove drive reference from /etc/cassandra/conf/cassandra.yaml
>> Restart node
>> Run repair
>>
>> Will that work? Right now, it's showing all nodes as up.
>>
>> -Joe
>>
>>> On 1/16/2023 11:55 AM, Tolbert, Andy wrote:
>>> Hi Joe,
>>>
>>> I'd recommend just doing a replacement, bringing up a new node with
>>> -Dcassandra.replace_address_first_boot=ip.you.are.replacing as
>>> described here:
>>> https://cassandra.apache.org/doc/4.1/cassandra/operating/topo_changes.html#replacing-a-dead-node
>>>
>>> Before you do that, you will want to make sure a cycle of repairs has
>>> run on the replicas of the down node to ensure they are consistent
>>> with each other.
>>>
>>> Make sure you also have 'auto_bootstrap: true' in the yaml of the node
>>> you are replacing and that the initial_token matches the node you are
>>> replacing (If you are not using vnodes) so the node doesn't skip
>>> bootstrapping. This is the default, but felt worth mentioning.
>>>
>>> You can also remove the dead node, which should stream data to
>>> replicas that will pick up new ranges, but you also will want to do
>>> repairs ahead of time too. To be honest it's not something I've done
>>> recently, so I'm not as confident on executing that procedure.
>>>
>>> Thanks,
>>> Andy
>>>
>>>
>>> On Mon, Jan 16, 2023 at 9:28 AM Joe Obernberger
>>> <jo...@gmail.com> wrote:
>>>> Hi all - what is the correct procedure when handling a failed disk?
>>>> Have a node in a 15 node cluster. This node has 16 drives and cassandra
>>>> data is split across them. One drive is failing. Can I just remove it
>>>> from the list and cassandra will then replicate? If not - what?
>>>> Thank you!
>>>>
>>>> -Joe
>>>>
>>>>
>>>> --
>>>> This email has been checked for viruses by AVG antivirus software.
>>>> www.avg.com
>>
>> --
>> This email has been checked for viruses by AVG antivirus software.
>> www.avg.com
Re: Failed disks - correct procedure
Posted by "Tolbert, Andy" <x...@andrewtolbert.com>.
Hi Joe,
Reading it back I realized I misunderstood that part of your email, so
you must be using data_file_directories with 16 drives? That's a lot
of drives! I imagine this may happen from time to time given that
disks like to fail.
That's a bit of an interesting scenario that I would have to think
about. If you brought the node up without the bad drive, repairs are
probably going to do a ton of repair overstreaming if you aren't using
4.0 (https://issues.apache.org/jira/browse/CASSANDRA-3200) which may
put things into a really bad state (lots of streaming = lots of
compactions = slower reads) and you may be seeing some inconsistency
if repairs weren't regularly running beforehand.
How much data was on the drive that failed? How much data do you
usually have per node?
Thanks,
Andy
On Mon, Jan 16, 2023 at 10:59 AM Joe Obernberger
<jo...@gmail.com> wrote:
>
> Thank you Andy.
> Is there a way to just remove the drive from the cluster and replace it
> later? Ordering replacement drives isn't a fast process...
> What I've done so far is:
> Stop node
> Remove drive reference from /etc/cassandra/conf/cassandra.yaml
> Restart node
> Run repair
>
> Will that work? Right now, it's showing all nodes as up.
>
> -Joe
>
> On 1/16/2023 11:55 AM, Tolbert, Andy wrote:
> > Hi Joe,
> >
> > I'd recommend just doing a replacement, bringing up a new node with
> > -Dcassandra.replace_address_first_boot=ip.you.are.replacing as
> > described here:
> > https://cassandra.apache.org/doc/4.1/cassandra/operating/topo_changes.html#replacing-a-dead-node
> >
> > Before you do that, you will want to make sure a cycle of repairs has
> > run on the replicas of the down node to ensure they are consistent
> > with each other.
> >
> > Make sure you also have 'auto_bootstrap: true' in the yaml of the node
> > you are replacing and that the initial_token matches the node you are
> > replacing (If you are not using vnodes) so the node doesn't skip
> > bootstrapping. This is the default, but felt worth mentioning.
> >
> > You can also remove the dead node, which should stream data to
> > replicas that will pick up new ranges, but you also will want to do
> > repairs ahead of time too. To be honest it's not something I've done
> > recently, so I'm not as confident on executing that procedure.
> >
> > Thanks,
> > Andy
> >
> >
> > On Mon, Jan 16, 2023 at 9:28 AM Joe Obernberger
> > <jo...@gmail.com> wrote:
> >> Hi all - what is the correct procedure when handling a failed disk?
> >> Have a node in a 15 node cluster. This node has 16 drives and cassandra
> >> data is split across them. One drive is failing. Can I just remove it
> >> from the list and cassandra will then replicate? If not - what?
> >> Thank you!
> >>
> >> -Joe
> >>
> >>
> >> --
> >> This email has been checked for viruses by AVG antivirus software.
> >> www.avg.com
>
> --
> This email has been checked for viruses by AVG antivirus software.
> www.avg.com
Re: Failed disks - correct procedure
Posted by Joe Obernberger <jo...@gmail.com>.
Thank you Andy.
Is there a way to just remove the drive from the cluster and replace it
later? Ordering replacement drives isn't a fast process...
What I've done so far is:
Stop node
Remove drive reference from /etc/cassandra/conf/cassandra.yaml
Restart node
Run repair
Will that work? Right now, it's showing all nodes as up.
-Joe
On 1/16/2023 11:55 AM, Tolbert, Andy wrote:
> Hi Joe,
>
> I'd recommend just doing a replacement, bringing up a new node with
> -Dcassandra.replace_address_first_boot=ip.you.are.replacing as
> described here:
> https://cassandra.apache.org/doc/4.1/cassandra/operating/topo_changes.html#replacing-a-dead-node
>
> Before you do that, you will want to make sure a cycle of repairs has
> run on the replicas of the down node to ensure they are consistent
> with each other.
>
> Make sure you also have 'auto_bootstrap: true' in the yaml of the node
> you are replacing and that the initial_token matches the node you are
> replacing (If you are not using vnodes) so the node doesn't skip
> bootstrapping. This is the default, but felt worth mentioning.
>
> You can also remove the dead node, which should stream data to
> replicas that will pick up new ranges, but you also will want to do
> repairs ahead of time too. To be honest it's not something I've done
> recently, so I'm not as confident on executing that procedure.
>
> Thanks,
> Andy
>
>
> On Mon, Jan 16, 2023 at 9:28 AM Joe Obernberger
> <jo...@gmail.com> wrote:
>> Hi all - what is the correct procedure when handling a failed disk?
>> Have a node in a 15 node cluster. This node has 16 drives and cassandra
>> data is split across them. One drive is failing. Can I just remove it
>> from the list and cassandra will then replicate? If not - what?
>> Thank you!
>>
>> -Joe
>>
>>
>> --
>> This email has been checked for viruses by AVG antivirus software.
>> www.avg.com
--
This email has been checked for viruses by AVG antivirus software.
www.avg.com
Re: Failed disks - correct procedure
Posted by "Tolbert, Andy" <x...@andrewtolbert.com>.
Hi Joe,
I'd recommend just doing a replacement, bringing up a new node with
-Dcassandra.replace_address_first_boot=ip.you.are.replacing as
described here:
https://cassandra.apache.org/doc/4.1/cassandra/operating/topo_changes.html#replacing-a-dead-node
Before you do that, you will want to make sure a cycle of repairs has
run on the replicas of the down node to ensure they are consistent
with each other.
Make sure you also have 'auto_bootstrap: true' in the yaml of the node
you are replacing and that the initial_token matches the node you are
replacing (If you are not using vnodes) so the node doesn't skip
bootstrapping. This is the default, but felt worth mentioning.
You can also remove the dead node, which should stream data to
replicas that will pick up new ranges, but you also will want to do
repairs ahead of time too. To be honest it's not something I've done
recently, so I'm not as confident on executing that procedure.
Thanks,
Andy
On Mon, Jan 16, 2023 at 9:28 AM Joe Obernberger
<jo...@gmail.com> wrote:
>
> Hi all - what is the correct procedure when handling a failed disk?
> Have a node in a 15 node cluster. This node has 16 drives and cassandra
> data is split across them. One drive is failing. Can I just remove it
> from the list and cassandra will then replicate? If not - what?
> Thank you!
>
> -Joe
>
>
> --
> This email has been checked for viruses by AVG antivirus software.
> www.avg.com