You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by William R <tr...@protonmail.com.INVALID> on 2019/06/04 16:54:53 UTC

Can I cancel a decommissioning procedure??

Hi,

Was an accidental decommissioning of a node and we really need to to cancel it.. is there any way? At the moment we keep the node down before figure out a way to cancel that.

Thanks

Re: Can I cancel a decommissioning procedure??

Posted by Alain RODRIGUEZ <ar...@gmail.com>.

Sure, you're welcome, glad to hear it worked! =)

Thanks for letting us know/reporting this back here, it might matter for
other people as well.

C*heers!
Alain


Le mer. 5 juin 2019 à 07:45, William R <tr...@protonmail.com> a écrit :

> Eventually after the reboot the decommission was cancelled. Thanks a lot
> for the info!
>
> Cheers
>
>
> Sent with ProtonMail <https://protonmail.com> Secure Email.
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Tuesday, June 4, 2019 10:59 PM, Alain RODRIGUEZ <ar...@gmail.com>
> wrote:
>
> > the issue is that the rest nodes in the cluster marked it as DL
> (DOWN/LEAVING) thats why I am kinda stressed.. Lets see once is up!
>
> The last information other nodes had is that this node is leaving, and
> down, that's expected in this situation. When the node comes back online,
> it should come back UN and 'quickly' other nodes should ACK it.
>
> During decommission, the node itself is responsible for streaming its data
> over. Streams were stopped as the node went down, Cassandra won't remove
> the node unless data was streamed properly (or if you force  the node out).
> I don't think that there is a decommission 'resume', and even les that it
> is enabled by default.
> Thus when the node comes back, the only possible option I see is a
> 'regular' start for that node and other to acknowledge that the node is up
> and not leaving anymore.
>
> The only consequence I expect (other than the node missing the latest
> data) is that other nodes might have some extra data due to the
> decommission attempts. If that's needed (streaming for long or no TTL), you
> can consider using 'nodetool cleanup -j 2' on all the other nodes than the
> one that went down, to remove the extra data (and free space).
>
>  I did restart, still waiting to come up (normally takes ~ 30 minutes)
>>
>
> 30 minutes to start the nodes sounds like a long time to me, but well,
> that's another topic.
>
> C*heers
> -----------------------
> Alain Rodriguez - alain@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> Le mar. 4 juin 2019 à 18:31, William R <tr...@protonmail.com> a écrit :
>
>> Hi Alain,
>>
>> Thank you for your comforting reply :)  I did restart, still waiting to
>> come up (normally takes ~ 30 minutes) , the issue is that the rest nodes in
>> the cluster marked it as DL (DOWN/LEAVING) thats why I am kinda stressed..
>> Lets see once is up!
>>
>>
>> Sent with ProtonMail <https://protonmail.com> Secure Email.
>>
>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>> On Tuesday, June 4, 2019 7:25 PM, Alain RODRIGUEZ <ar...@gmail.com>
>> wrote:
>>
>> Hello William,
>>
>> At the moment we keep the node down before figure out a way to cancel
>>> that.
>>>
>>
>> Off the top of my head, a restart of the node is the way to go to cancel
>> a decommission.
>> I think you did the right thing and your safety measure is also the fix
>> here :).
>>
>> Did you try to bring it up again?
>>
>> If it's really critical, you can probably test that quickly with ccm (
>> https://github.com/riptano/ccm), tlp-cluster (
>> https://github.com/thelastpickle/tlp-cluster) or simply with any
>> existing dev/test environment if you have any available with some data.
>>
>> Good luck with that, a PEBKAC issue are the worst. You can do a lot of
>> damage, could always have avoided it and it makes you feel terrible.
>> It doesn't sound that bad in your case though, I've seen (and done)
>> worse  ¯\_(ツ)_/¯. It's hard to fight PEBKACs, we, operators, are
>> unpredictable :).
>> Nonetheless, and to go back to something more serious, there are ways to
>> limit the amount and possible scope of those, such as good practices,
>> testing and automations.
>>
>> C*heers,
>> -----------------------
>> Alain Rodriguez - alain@thelastpickle.com
>> France / Spain
>>
>> The Last Pickle - Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>>
>>
>> Le mar. 4 juin 2019 à 17:55, William R <tr...@protonmail.com.invalid> a
>> écrit :
>>
>>> Hi,
>>>
>>> Was an accidental decommissioning of a node and we really need to to
>>> cancel it.. is there any way? At the moment we keep the node down before
>>> figure out a way to cancel that.
>>>
>>> Thanks
>>>
>>
>>
>

Re: Can I cancel a decommissioning procedure??

Posted by William R <tr...@protonmail.com.INVALID>.

Eventually after the reboot the decommission was cancelled. Thanks a lot for the info!

Cheers

Sent with [ProtonMail](https://protonmail.com) Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Tuesday, June 4, 2019 10:59 PM, Alain RODRIGUEZ <ar...@gmail.com> wrote:

>> the issue is that the rest nodes in the cluster marked it as DL (DOWN/LEAVING) thats why I am kinda stressed.. Lets see once is up!
>
> The last information other nodes had is that this node is leaving, and down, that's expected in this situation. When the node comes back online, it should come back UN and 'quickly' other nodes should ACK it.
>
> During decommission, the node itself is responsible for streaming its data over. Streams were stopped as the node went down, Cassandra won't remove the node unless data was streamed properly (or if you force  the node out). I don't think that there is a decommission 'resume', and even les that it is enabled by default.
> Thus when the node comes back, the only possible option I see is a 'regular' start for that node and other to acknowledge that the node is up and not leaving anymore.
>
> The only consequence I expect (other than the node missing the latest data) is that other nodes might have some extra data due to the decommission attempts. If that's needed (streaming for long or no TTL), you can consider using 'nodetool cleanup -j 2' on all the other nodes than the one that went down, to remove the extra data (and free space).
>
>>  I did restart, still waiting to come up (normally takes ~ 30 minutes)
>
> 30 minutes to start the nodes sounds like a long time to me, but well, that's another topic.
>
> C*heers
> -----------------------
> Alain Rodriguez - alain@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> Le mar. 4 juin 2019 à 18:31, William R <tr...@protonmail.com> a écrit :
>
>> Hi Alain,
>>
>> Thank you for your comforting reply :)  I did restart, still waiting to come up (normally takes ~ 30 minutes) , the issue is that the rest nodes in the cluster marked it as DL (DOWN/LEAVING) thats why I am kinda stressed.. Lets see once is up!
>>
>> Sent with [ProtonMail](https://protonmail.com) Secure Email.
>>
>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>> On Tuesday, June 4, 2019 7:25 PM, Alain RODRIGUEZ <ar...@gmail.com> wrote:
>>
>>> Hello William,
>>>
>>>> At the moment we keep the node down before figure out a way to cancel that.
>>>
>>> Off the top of my head, a restart of the node is the way to go to cancel a decommission.
>>> I think you did the right thing and your safety measure is also the fix here :).
>>>
>>> Did you try to bring it up again?
>>>
>>> If it's really critical, you can probably test that quickly with ccm (https://github.com/riptano/ccm), tlp-cluster (https://github.com/thelastpickle/tlp-cluster) or simply with any existing dev/test environment if you have any available with some data.
>>>
>>> Good luck with that, a PEBKAC issue are the worst. You can do a lot of damage, could always have avoided it and it makes you feel terrible.
>>> It doesn't sound that bad in your case though, I've seen (and done) worse  ¯\_(ツ)_/¯. It's hard to fight PEBKACs, we, operators, are unpredictable :).
>>> Nonetheless, and to go back to something more serious, there are ways to limit the amount and possible scope of those, such as good practices, testing and automations.
>>>
>>> C*heers,
>>> -----------------------
>>> Alain Rodriguez - alain@thelastpickle.com
>>> France / Spain
>>>
>>> The Last Pickle - Apache Cassandra Consulting
>>> http://www.thelastpickle.com
>>>
>>> Le mar. 4 juin 2019 à 17:55, William R <tr...@protonmail.com.invalid> a écrit :
>>>
>>>> Hi,
>>>>
>>>> Was an accidental decommissioning of a node and we really need to to cancel it.. is there any way? At the moment we keep the node down before figure out a way to cancel that.
>>>>
>>>> Thanks

Re: Can I cancel a decommissioning procedure??

Posted by William R <tr...@protonmail.com.INVALID>.

Hi Alain,

Thank you for your comforting reply :)  I did restart, still waiting to come up (normally takes ~ 30 minutes) , the issue is that the rest nodes in the cluster marked it as DL (DOWN/LEAVING) thats why I am kinda stressed.. Lets see once is up!

Sent with [ProtonMail](https://protonmail.com) Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Tuesday, June 4, 2019 7:25 PM, Alain RODRIGUEZ <ar...@gmail.com> wrote:

> Hello William,
>
>> At the moment we keep the node down before figure out a way to cancel that.
>
> Off the top of my head, a restart of the node is the way to go to cancel a decommission.
> I think you did the right thing and your safety measure is also the fix here :).
>
> Did you try to bring it up again?
>
> If it's really critical, you can probably test that quickly with ccm (https://github.com/riptano/ccm), tlp-cluster (https://github.com/thelastpickle/tlp-cluster) or simply with any existing dev/test environment if you have any available with some data.
>
> Good luck with that, a PEBKAC issue are the worst. You can do a lot of damage, could always have avoided it and it makes you feel terrible.
> It doesn't sound that bad in your case though, I've seen (and done) worse  ¯\_(ツ)_/¯. It's hard to fight PEBKACs, we, operators, are unpredictable :).
> Nonetheless, and to go back to something more serious, there are ways to limit the amount and possible scope of those, such as good practices, testing and automations.
>
> C*heers,
> -----------------------
> Alain Rodriguez - alain@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> Le mar. 4 juin 2019 à 17:55, William R <tr...@protonmail.com.invalid> a écrit :
>
>> Hi,
>>
>> Was an accidental decommissioning of a node and we really need to to cancel it.. is there any way? At the moment we keep the node down before figure out a way to cancel that.
>>
>> Thanks

Re: Can I cancel a decommissioning procedure??

Posted by Alain RODRIGUEZ <ar...@gmail.com>.

Hello William,

At the moment we keep the node down before figure out a way to cancel that.
>

Off the top of my head, a restart of the node is the way to go to cancel a
decommission.
I think you did the right thing and your safety measure is also the fix
here :).

Did you try to bring it up again?

If it's really critical, you can probably test that quickly with ccm (
https://github.com/riptano/ccm), tlp-cluster (
https://github.com/thelastpickle/tlp-cluster) or simply with any existing
dev/test environment if you have any available with some data.

Good luck with that, a PEBKAC issue are the worst. You can do a lot of
damage, could always have avoided it and it makes you feel terrible.
It doesn't sound that bad in your case though, I've seen (and done) worse
¯\_(ツ)_/¯. It's hard to fight PEBKACs, we, operators, are unpredictable :).
Nonetheless, and to go back to something more serious, there are ways to
limit the amount and possible scope of those, such as good practices,
testing and automations.

C*heers,
-----------------------
Alain Rodriguez - alain@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com



Le mar. 4 juin 2019 à 17:55, William R <tr...@protonmail.com.invalid> a
écrit :

> Hi,
>
> Was an accidental decommissioning of a node and we really need to to
> cancel it.. is there any way? At the moment we keep the node down before
> figure out a way to cancel that.
>
> Thanks
>