You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Erik Forsberg <fo...@opera.com> on 2015/04/15 14:15:57 UTC

One node misbehaving (lot's of GC), ideas?

Hi!

We having problems with one node (out of 56 in total) misbehaving.
Symptoms are:

* High number of full CMS old space collections during early morning
when we're doing bulkloads. Yes, bulkloads, not CQL, and only a few
thrift insertions.
* Really long stop-the-world GC events (I've seen up to 50 seconds) for
both CMS and ParNew.
* CPU usage higher during early morning hours compared to other nodes.
* The large number of Garbage Collections *seems* to correspond to doing
a lot of compactions (SizeTiered for most of our CFs, Leveled for a few
small ones)
* Node loosing track of what other nodes are up and keeping that state
until restart (this I think is a bug caused by the GC behaviour, with
the stop-the-world making the node not accepting gossip connections from
other nodes)

This is on 2.0.13 with vnodes (256 per node).

All other nodes have normal behaviour, with a few (2-3) full CMS old
space  in the same 3h period that the trouble node is making some 30
ones. Heap space is 8G, with NEW_SIZE set to 800M. With 6G/800M the
problem was even worse (it seems, this is a bit hard to debug as it
happens *almost* every night).

nodetool status shows that although we have a certain unbalance in the
cluster, this node is neither the most nor the least loaded. I.e. we
have between 1.6% and 2.1% in the "Owns" column, and the troublesome
node reports 1.7%.

All nodes are under puppet control, so configuration is the same
everywhere.

We're running NetworkTopolyStrategy with rack awareness, and here's a
deviation from recommended settings - we have slightly varying number of
nodes in the racks:

     15 cssa01
     15 cssa02
     13 cssa03
     13 cssa04

The affected node is in the cssa04 rack. Could this mean I have some
kind of hotspot situation? Why would that show up as more GC work?

I'm quite puzzled here, so I'm looking for hints on how to identify what
is causing this.

Regards,
\EF

Re: One node misbehaving (lot's of GC), ideas?

Posted by Michal Michalski <mi...@boxever.com>.

Hi Erik,

Forgetting for a while that it's only a single row: does this node store
any super-long rows?
The first things that come to my mind after reading your e-mail is
unthrottled compaction (sounds like a possible issue, but it would affect
other nodes too) or very large rows. Or a mix of both?
Maybe it will be of your interest:
http://aryanet.com/blog/cassandra-garbage-collector-tuning re investigating
GC issues (if you haven't seen it yet) and pinning it down further.

M.



Kind regards,
Michał Michalski,
michal.michalski@boxever.com

On 15 April 2015 at 13:15, Erik Forsberg <fo...@opera.com> wrote:

> Hi!
>
> We having problems with one node (out of 56 in total) misbehaving.
> Symptoms are:
>
> * High number of full CMS old space collections during early morning
> when we're doing bulkloads. Yes, bulkloads, not CQL, and only a few
> thrift insertions.
> * Really long stop-the-world GC events (I've seen up to 50 seconds) for
> both CMS and ParNew.
> * CPU usage higher during early morning hours compared to other nodes.
> * The large number of Garbage Collections *seems* to correspond to doing
> a lot of compactions (SizeTiered for most of our CFs, Leveled for a few
> small ones)
> * Node loosing track of what other nodes are up and keeping that state
> until restart (this I think is a bug caused by the GC behaviour, with
> the stop-the-world making the node not accepting gossip connections from
> other nodes)
>
> This is on 2.0.13 with vnodes (256 per node).
>
> All other nodes have normal behaviour, with a few (2-3) full CMS old
> space  in the same 3h period that the trouble node is making some 30
> ones. Heap space is 8G, with NEW_SIZE set to 800M. With 6G/800M the
> problem was even worse (it seems, this is a bit hard to debug as it
> happens *almost* every night).
>
> nodetool status shows that although we have a certain unbalance in the
> cluster, this node is neither the most nor the least loaded. I.e. we
> have between 1.6% and 2.1% in the "Owns" column, and the troublesome
> node reports 1.7%.
>
> All nodes are under puppet control, so configuration is the same
> everywhere.
>
> We're running NetworkTopolyStrategy with rack awareness, and here's a
> deviation from recommended settings - we have slightly varying number of
> nodes in the racks:
>
>      15 cssa01
>      15 cssa02
>      13 cssa03
>      13 cssa04
>
> The affected node is in the cssa04 rack. Could this mean I have some
> kind of hotspot situation? Why would that show up as more GC work?
>
> I'm quite puzzled here, so I'm looking for hints on how to identify what
> is causing this.
>
> Regards,
> \EF
>
>
>
>
>