You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Oleksandr Shulgin <ol...@zalando.de> on 2018/01/19 08:42:41 UTC

Decommissioned nodes and FailureDetector

Hello,

Is there a better way to monitor for Cassandra nodes going Down than
querying via JMX for a condition like FailureDetector.DownEndpointCount > 0?

The problem for us is when any node is decommissioned, it affects the
DownEndpointCount for another ~3 days (the famous 72 hours of gossip).

Is there a similar metric to be observed which doesn't include nodes which
are expected to be down?

Regards,
-- 
Oleksandr "Alex" Shulgin | Database Engineer | Zalando SE | Tel: +49 176
127-59-707

Re: Decommissioned nodes and FailureDetector

Posted by Oleksandr Shulgin <ol...@zalando.de>.
On Fri, Jan 19, 2018 at 11:17 AM, Nicolas Guyomar <nicolas.guyomar@gmail.com
> wrote:

> Hi,
>
> Not sure if StorageService should be accessed, but you can check node
> movement here :
> 'org.apache.cassandra.db:type=StorageService/LeavingNodes',
> 'org.apache.cassandra.db:type=StorageService/LiveNodes',
> 'org.apache.cassandra.db:type=StorageService/UnreachableNodes',
>

Checking the list of  Unreachable Nodes doesn't help unfortunately, since
it contains a mix of decommissioned and just DOWN nodes.  So the total
number of addresses in this list is equal to the DownEndpointCount, from
the perspective of a node where you query it.

--
Alex

Re: Decommissioned nodes and FailureDetector

Posted by Nicolas Guyomar <ni...@gmail.com>.
Hi,

Not sure if StorageService should be accessed, but you can check node
movement here :
'org.apache.cassandra.db:type=StorageService/LeavingNodes',
'org.apache.cassandra.db:type=StorageService/LiveNodes',
'org.apache.cassandra.db:type=StorageService/UnreachableNodes',
'org.apache.cassandra.db:type=StorageService/MovingNodes'

On 19 January 2018 at 09:42, Oleksandr Shulgin <oleksandr.shulgin@zalando.de
> wrote:

> Hello,
>
> Is there a better way to monitor for Cassandra nodes going Down than
> querying via JMX for a condition like FailureDetector.DownEndpointCount >
> 0?
>
> The problem for us is when any node is decommissioned, it affects the
> DownEndpointCount for another ~3 days (the famous 72 hours of gossip).
>
> Is there a similar metric to be observed which doesn't include nodes which
> are expected to be down?
>
> Regards,
> --
> Oleksandr "Alex" Shulgin | Database Engineer | Zalando SE | Tel: +49 176
> 127-59-707 <+49%20176%2012759707>
>
>