You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by onmstester onmstester <on...@zoho.com.INVALID> on 2018/12/04 10:51:25 UTC

How to gracefully decommission a highly loaded node?

One node suddenly uses 100% CPU, i suspect hardware problems and do not have time to trace that, so decided to just remove the node from the cluster, but although the node state changed to UL, but no sign of Leaving: the node is still compacting and flushing memtables, writing mutations and CPU is 100% for hours since. Is there any means to force a Cassandra Node to just decommission and stop doing normal things? Due to W.CL=ONE, i can not use removenode and shutdown the node Best Regards Sent using Zoho Mail

Re: Re: How to gracefully decommission a highly loaded node?

Posted by Oleksandr Shulgin <ol...@zalando.de>.

On Mon, Dec 17, 2018 at 11:44 AM Riccardo Ferrari <fe...@gmail.com>
wrote:

> I am having "the same" issue.
> One of my nodes seems to have some hardware struggle, out of 6 nodes (same
> instance size) this one is likely to be makred down, it consntantly
> compacting, high system load, it's just a big pain.
>
> My idea was to add nodes and decommission all the one running on old
> hardware (m1.xlarge), however this very specific "bad" node is causing
> trouble to the whole cluster and decided to decommission it first.
>
> The node is simply stuck in "LEAVING" - Not sending any stream. I already
> have disabled binary and autocompactions and tried to restart the
> decommission process couple of times with no luck.
> Any suggestions?
> assassinate vs removenode?
> Any tuning that could help?
>

If it's stuck that badly, then I would consider it lost and just do
replacenode.  Hope it's not too late if you started to decommission?

Cheers,
--
Alex

Re: Re: How to gracefully decommission a highly loaded node?

Posted by Riccardo Ferrari <fe...@gmail.com>.

I am having "the same" issue.
One of my nodes seems to have some hardware struggle, out of 6 nodes (same
instance size) this one is likely to be makred down, it consntantly
compacting, high system load, it's just a big pain.

My idea was to add nodes and decommission all the one running on old
hardware (m1.xlarge), however this very specific "bad" node is causing
trouble to the whole cluster and decided to decommission it first.

The node is simply stuck in "LEAVING" - Not sending any stream. I already
have disabled binary and autocompactions and tried to restart the
decommission process couple of times with no luck.
Any suggestions?
assassinate vs removenode?
Any tuning that could help?

Best,

On Thu, Dec 6, 2018 at 10:59 AM onmstester onmstester
<on...@zoho.com.invalid> wrote:

> After few hours, i just removed the node. done another node
> decommissioned, which finished successfully (the writer app was down, so no
> pressure on the cluster)
> Started another node decommission (third), Since didn't have time to wait
> for decommissioning to finish, i started the writer Application, when
> almost most of decommissioning-node's streaming was done and only a few GBs
> to two other nodes remained to be streamed.
> After 12 Hours i checked the decommissioning node  and netstats says:
> LEAVING, Restore Replica Count....!
> So just ran removednode on this one too.
> Is there something wrong with decommissioning while someones writing to
> Cluster?
> Using Apache Cassandra 3.11.2
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
> ============ Forwarded message ============
> From : onmstester onmstester <on...@zoho.com.INVALID>
> To : "user"<us...@cassandra.apache.org>
> Date : Wed, 05 Dec 2018 09:00:34 +0330
> Subject : Fwd: Re: How to gracefully decommission a highly loaded node?
> ============ Forwarded message ============
>
> After a long time stuck in LEAVING, and "not doing any streams", i killed
> Cassandra process and restart it, then again ran nodetool decommission
> (Datastax recipe for stuck decommission),
> now it says, LEAVING, "unbootstrap $(the node id)"
>
> What's going on? Should i forget about decommission and just remove the
> node?
>
> There is an issue to make decommission resumable:
> https://issues.apache.org/jira/browse/CASSANDRA-12008
>
> but i couldn't figure out how this suppose to work? I was expecting that
> after restarting stucked-decommission-cassandra, it resume the
> decommissioning process, but the node became UN after restart.
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
> ============ Forwarded message ============
> From : Simon Fontana Oscarsson <si...@ericsson.com>
> To : "user@cassandra.apache.org"<us...@cassandra.apache.org>
> Date : Tue, 04 Dec 2018 15:20:15 +0330
> Subject : Re: How to gracefully decommission a highly loaded node?
> ============ Forwarded message ============
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: user-help@cassandra.apache.org
>
> Hi,
>
> If it already uses 100 % CPU I have a hard time seeing it being able to do
> a decomission while serving requests. If you have a lot of free space I
> would first try nodetool disableautocompaction. If you don't see any
> progress in nodetool netstats you can also disablebinary, disablethrift and
> disablehandoff to stop serving client requests.
>
> --
>
> SIMON FONTANA OSCARSSON
> Software Developer
>
> Ericsson
> Ölandsgatan 1
> 37133 Karlskrona, Swedensimon.fontana.oscarsson@ericsson.comwww.ericsson.com
>
>
> On tis, 2018-12-04 at 14:21 +0330, onmstester onmstester wrote:
>
> One node suddenly uses 100% CPU, i suspect hardware problems and do not
> have time to trace that, so decided to just remove the node from the
> cluster, but although the node state changed to UL, but no sign of Leaving:
> the node is still compacting and flushing memtables, writing mutations and
> CPU is 100% for hours since.
> Is there any means to force a Cassandra Node to just decommission and stop
> doing normal things?
> Due to W.CL=ONE, i can not use removenode and shutdown the node
>
> Best Regards
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: user-help@cassandra.apache.org

Fwd: Re: How to gracefully decommission a highly loaded node?

Posted by onmstester onmstester <on...@zoho.com.INVALID>.

After few hours, i just removed the node. done another node decommissioned, which finished successfully (the writer app was down, so no pressure on the cluster)  Started another node decommission (third), Since didn't have time to wait for decommissioning to finish, i started the writer Application, when almost most of decommissioning-node's streaming was done and only a few GBs to two other nodes remained to be streamed. After 12 Hours i checked the decommissioning node  and netstats says: LEAVING, Restore Replica Count....! So just ran removednode on this one too. Is there something wrong with decommissioning while someones writing to Cluster? Using Apache Cassandra 3.11.2 Sent using Zoho Mail ============ Forwarded message ============ From : onmstester onmstester <on...@zoho.com.INVALID> To : "user"<us...@cassandra.apache.org> Date : Wed, 05 Dec 2018 09:00:34 +0330 Subject : Fwd: Re: How to gracefully decommission a highly loaded node? ============ Forwarded message ============ After a long time stuck in LEAVING, and "not doing any streams", i killed Cassandra process and restart it, then again ran nodetool decommission (Datastax recipe for stuck decommission), now it says, LEAVING, "unbootstrap $(the node id)" What's going on? Should i forget about decommission and just remove the node? There is an issue to make decommission resumable: https://issues.apache.org/jira/browse/CASSANDRA-12008 but i couldn't figure out how this suppose to work? I was expecting that after restarting stucked-decommission-cassandra, it resume the decommissioning process, but the node became UN after restart. Sent using Zoho Mail ============ Forwarded message ============ From : Simon Fontana Oscarsson <si...@ericsson.com> To : "user@cassandra.apache.org"<us...@cassandra.apache.org> Date : Tue, 04 Dec 2018 15:20:15 +0330 Subject : Re: How to gracefully decommission a highly loaded node? ============ Forwarded message ============ --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org For additional commands, e-mail: user-help@cassandra.apache.org Hi, If it already uses 100 % CPU I have a hard time seeing it being able to do a decomission while serving requests. If you have a lot of free space I would first try nodetool disableautocompaction. If you don't see any progress in nodetool netstats you can also disablebinary, disablethrift and disablehandoff to stop serving client requests.  -- SIMON FONTANA OSCARSSON
Software Developer

Ericsson
Ölandsgatan 1
37133 Karlskrona, Sweden
simon.fontana.oscarsson@ericsson.com
www.ericsson.com On tis, 2018-12-04 at 14:21 +0330, onmstester onmstester wrote: One node suddenly uses 100% CPU, i suspect hardware problems and do not have time to trace that, so decided to just remove the node from the cluster, but although the node state changed to UL, but no sign of Leaving: the node is still compacting and flushing memtables, writing mutations and CPU is 100% for hours since. Is there any means to force a Cassandra Node to just decommission and stop doing normal things? Due to W.CL=ONE, i can not use removenode and shutdown the node Best Regards Sent using Zoho Mail

Fwd: Re: How to gracefully decommission a highly loaded node?

Posted by onmstester onmstester <on...@zoho.com.INVALID>.

After a long time stuck in LEAVING, and "not doing any streams", i killed Cassandra process and restart it, then again ran nodetool decommission (Datastax recipe for stuck decommission), now it says, LEAVING, "unbootstrap $(the node id)" What's going on? Should i forget about decommission and just remove the node? There is an issue to make decommission resumable: https://issues.apache.org/jira/browse/CASSANDRA-12008 but i couldn't figure out how this suppose to work? I was expecting that after restarting stucked-decommission-cassandra, it resume the decommissioning process, but the node became UN after restart. Sent using Zoho Mail ============ Forwarded message ============ From : Simon Fontana Oscarsson <si...@ericsson.com> To : "user@cassandra.apache.org"<us...@cassandra.apache.org> Date : Tue, 04 Dec 2018 15:20:15 +0330 Subject : Re: How to gracefully decommission a highly loaded node? ============ Forwarded message ============ Hi, If it already uses 100 % CPU I have a hard time seeing it being able to do a decomission while serving requests. If you have a lot of free space I would first try nodetool disableautocompaction. If you don't see any progress in nodetool netstats you can also disablebinary, disablethrift and disablehandoff to stop serving client requests.  -- SIMON FONTANA OSCARSSON
Software Developer

Ericsson
Ölandsgatan 1
37133 Karlskrona, Sweden
simon.fontana.oscarsson@ericsson.com
www.ericsson.com On tis, 2018-12-04 at 14:21 +0330, onmstester onmstester wrote: One node suddenly uses 100% CPU, i suspect hardware problems and do not have time to trace that, so decided to just remove the node from the cluster, but although the node state changed to UL, but no sign of Leaving: the node is still compacting and flushing memtables, writing mutations and CPU is 100% for hours since. Is there any means to force a Cassandra Node to just decommission and stop doing normal things? Due to W.CL=ONE, i can not use removenode and shutdown the node Best Regards Sent using Zoho Mail

Re: How to gracefully decommission a highly loaded node?

Posted by Simon Fontana Oscarsson <si...@ericsson.com>.

Hi,
If it already uses 100 % CPU I have a hard time seeing it being able to
do a decomission while serving requests. If you have a lot of free
space I would first try nodetool disableautocompaction. If you don't
see any progress in nodetool netstats you can also disablebinary,
disablethrift and disablehandoff to stop serving client requests. 

-- 
SIMON FONTANA OSCARSSON
Software Developer

Ericsson
Ölandsgatan 1
37133 Karlskrona, Sweden
simon.fontana.oscarsson@ericsson.com
www.ericsson.com
On tis, 2018-12-04 at 14:21 +0330, onmstester onmstester wrote:
> One node suddenly uses 100% CPU, i suspect hardware problems and do
> not have time to trace that, so decided to just remove the node from
> the cluster, but although the node state changed to UL, but no sign
> of Leaving: the node is still compacting and flushing memtables,
> writing mutations and CPU is 100% for hours since.
> Is there any means to force a Cassandra Node to just decommission and
> stop doing normal things?
> Due to W.CL=ONE, i can not use removenode and shutdown the node
> 
> Best Regards
> Sent using Zoho Mail
> 
>