You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by "C. Scott Andreas" <cs...@urbanairship.com> on 2011/04/06 21:07:22 UTC

Flush / Snapshot Triggering Full GCs, Leaving Ring

Hello,

We're running a six-node 0.7.4 ring in EC2 on m1.xlarge instances with 4GB heap (15GB total memory, 4 cores, dataset fits in RAM, storage on ephemeral disk). We've noticed a brief flurry of query failures during the night corresponding with our backup schedule. More specifically, our logs suggest that calling "nodetool snapshot" on a node is triggering 12 to 16 second CMS GCs and a promotion failure resulting in a full stop-the-world collection, during which the node is marked dead by the ring until re-joining shortly after.

Here's a log from one of the nodes, along with system info and JVM options: https://gist.github.com/e12c6cae500e118676d1

At 13:15:00, our backup cron job runs, which calls nodetool flush, then nodetool snapshot. (After investigating, we noticed that calling both flush and snapshot is unnecessary, and have since updated the script to only call snapshot). While writing memtables, we'll generally see a GC logged out via Cassandra such as:

"GC for ConcurrentMarkSweep: 16113 ms, 1755422432 reclaimed leaving 1869123536 used; max is 4424663040."

In the JVM GC logs, we'll often see a tenured promotion failure occurring during this collection, resulting in a full stop-the-world GC like this (different node):

1180629.380: [CMS1180634.414: [CMS-concurrent-mark: 6.041/6.468 secs] [Times: user=8.00 sys=0.10, real=6.46 secs] 
 (concurrent mode failure): 3904635K->1700629K(4109120K), 16.0548910 secs] 3958389K->1700629K(4185792K), [CMS Perm : 19610K->19601K(32796K)], 16.1057040 secs] [Times: user=14.39 sys=0.02, real=16.10 secs]

During the GC, the rest of the ring will shun the node, and when the collection completes, the node will mark all other hosts in the ring as dead. The node and ring stabilize shortly after once detecting each other as up and completing hinted handoff (details in log).

We've enabled JNA on one of the nodes to prevent forking a subprocess to call `ln` during a snapshot yesterday and still observed a concurrent mode failure collection following a flush/snapshot, but the CMS length was shorter (9 seconds) and did not result in the node being shunned from the ring.

While the query failures that result from this activity are brief, our retry threshold is set to 6 for timeout exceptions. We're concerned that we're exceeding that, and would like to figure out why we see long CMS collections + promotion failures triggering full GCs during a snapshot.

Has anyone seen this, or have suggestions on how to prevent full GCs from occurring during a flush / snapshot?

Thanks,

- Scott

---

C. Scott Andreas
Engineer, Urban Airship, Inc.
http://www.urbanairship.com

Re: Flush / Snapshot Triggering Full GCs, Leaving Ring

Posted by Janne Jalkanen <Ja...@ecyrd.com>.

On Apr 7, 2011, at 23:43 , Jonathan Ellis wrote:

> The history is that, way back in the early days, we used to max it out
> the other way (MTT=128) but observed behavior is that objects that
> survive 1 new gen collection are very likely to survive "forever."

Just a quick note: my own tests seem to indicate likewise - I've been running one machine in our cluster with MTT=8 for some time now, and I'm not seeing any real difference between MTT=1 and MTT=8, except for a (very) slight increase in CPU usage, consistent with the increased copying. We have a read-heavy cluster that uses RowCaches a lot, so YMMV.

/Janne

Re: Flush / Snapshot Triggering Full GCs, Leaving Ring

Posted by Jonathan Ellis <jb...@gmail.com>.

On Thu, Apr 7, 2011 at 2:27 PM, Erik Onnen <eo...@gmail.com> wrote:
> 1) Does this seem like a sane amount of garbage (512MB) to generate
> when flushing a 64MB table to disk?

Sort of -- that's just about exactly the amount of space you'd expect
64MB of serialized data to take, in memory. (Not very efficient, I
know.)  So, you would expect that much to be available to GC, after a
flush.

Also, flush creates a buffer equal to in_memory_compaction_limit.  So
that will also generate a spike.  I think you upgraded from 0.6 -- if
the converter turned row size warning limit into i_m_c_l then it could
be much larger.

Otherwise, not sure why flush would consume that much *extra* though.
Smells like something unexpected in the flush code to me.  I don't see
anything obvious though.  SSTableWriter serializes directly to the
outputstream without (m)any other allocations.

> 2) Is this possibly a case of the MaxTenuringThreshold=1 working
> against cassandra? The flush seems to create a lot of garbage very
> quickly such that normal CMS isn't even possible. I'm sure there was a
> reason to introduce this setting but I'm not sure it's universally
> beneficial. Is there any history on the decision to opt for immediate
> promotion rather than using an adaptable number of survivor
> generations?

The history is that, way back in the early days, we used to max it out
the other way (MTT=128) but observed behavior is that objects that
survive 1 new gen collection are very likely to survive "forever."
This fits with what we expect theoretically: read requests and
ephemera from write requests will happen in a small number of ms, but
memtable data is not GC-able until flush. (Rowcache data of course is
effectively unbounded in tenure.)  Keeping long-lived data in a
survivor space just makes new gen collections take longer since you
are copying that data back and forth over and over.

(We have advised some read-heavy customers to ramp up to MTT=16, so
it's not a hard-and-fast rule, but it still feels like a reasonable
starting point to me.)

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Flush / Snapshot Triggering Full GCs, Leaving Ring

Posted by Erik Onnen <eo...@gmail.com>.

I'll capture what I we're seeing here for anyone else who may look
into this in more detail later.

Our standard heap growth is ~300K in between collections with regular
ParNew collections happening on average about every 4 seconds. All
very healthy.

The memtable flush (where we see almost all our CMS activity) seems to
have some balloon effect that despite a 64MB memtable size, causes
over 512MB heap to be consumed in half a second. In addition to the
hefty amount of garbage it causes, due to the MaxTenuringThreshold=1
setting most of that garbage seems to spill immediately into the
tenured generation which quickly fills and triggers a CMS. The rate of
garbage overflowing to tenured seems to outstrip the speed of the
concurrent mark worker which is almost always interrupted and failed
to a concurrent collection. However, the tenured collection is usually
hugely effective, recovering over half the total heap.

Two questions for the group then:

1) Does this seem like a sane amount of garbage (512MB) to generate
when flushing a 64MB table to disk?
2) Is this possibly a case of the MaxTenuringThreshold=1 working
against cassandra? The flush seems to create a lot of garbage very
quickly such that normal CMS isn't even possible. I'm sure there was a
reason to introduce this setting but I'm not sure it's universally
beneficial. Is there any history on the decision to opt for immediate
promotion rather than using an adaptable number of survivor
generations?

Re: Flush / Snapshot Triggering Full GCs, Leaving Ring

Posted by Jonathan Ellis <jb...@gmail.com>.

No, 2252 is not suitable for backporting to 0.7.

On Thu, Apr 7, 2011 at 7:33 AM, ruslan usifov <ru...@gmail.com> wrote:
>
>
> 2011/4/7 Jonathan Ellis <jb...@gmail.com>
>>
>> Hypothesis: it's probably the flush causing the CMS, not the snapshot
>> linking.
>>
>> Confirmation possibility #1: Add a logger.warn to
>> CLibrary.createHardLinkWithExec -- with JNA enabled it shouldn't be
>> called, but let's rule it out.
>>
>> Confirmation possibility #2: Force some flushes w/o snapshot.
>>
>> Either way: "concurrent mode failure" is the easy GC problem.
>> Hopefully you really are seeing mostly that -- this means the JVM
>> didn't start CMS early enough, so it ran out of space before it could
>> finish the concurrent collection, so it falls back to stop-the-world.
>> The fix is a combination of reducing XX:CMSInitiatingOccupancyFraction
>> and (possibly) increasing heap capacity if your heap is simply too
>> full too much of the time.
>>
>> You can also mitigate it by increasing the phi threshold for the
>> failure detector, so the node doing the GC doesn't mark everyone else
>> as dead.
>>
>> (Eventually your heap will fragment and you will see STW collections
>> due to "promotion failed," but you should see that much less
>> frequently. GC tuning to reduce fragmentation may be possible based on
>> your workload, but that's out of scope here and in any case the "real"
>> fix for that is https://issues.apache.org/jira/browse/CASSANDRA-2252.)
>>
>
> Jonatan do you have plans to backport this to 0.7 branch. (Because It's very
> hard to tune CMS, and if people is novice in java this task becomes much
> harder )
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Flush / Snapshot Triggering Full GCs, Leaving Ring

Posted by ruslan usifov <ru...@gmail.com>.

2011/4/7 Jonathan Ellis <jb...@gmail.com>

> Hypothesis: it's probably the flush causing the CMS, not the snapshot
> linking.
>
> Confirmation possibility #1: Add a logger.warn to
> CLibrary.createHardLinkWithExec -- with JNA enabled it shouldn't be
> called, but let's rule it out.
>
> Confirmation possibility #2: Force some flushes w/o snapshot.
>
> Either way: "concurrent mode failure" is the easy GC problem.
> Hopefully you really are seeing mostly that -- this means the JVM
> didn't start CMS early enough, so it ran out of space before it could
> finish the concurrent collection, so it falls back to stop-the-world.
> The fix is a combination of reducing XX:CMSInitiatingOccupancyFraction
> and (possibly) increasing heap capacity if your heap is simply too
> full too much of the time.
>
> You can also mitigate it by increasing the phi threshold for the
> failure detector, so the node doing the GC doesn't mark everyone else
> as dead.
>
> (Eventually your heap will fragment and you will see STW collections
> due to "promotion failed," but you should see that much less
> frequently. GC tuning to reduce fragmentation may be possible based on
> your workload, but that's out of scope here and in any case the "real"
> fix for that is https://issues.apache.org/jira/browse/CASSANDRA-2252.)
>
>
Jonatan do you have plans to backport this to 0.7 branch. (Because It's very
hard to tune CMS, and if people is novice in java this task becomes much
harder )

Re: Flush / Snapshot Triggering Full GCs, Leaving Ring

Posted by Jonathan Ellis <jb...@gmail.com>.

Hypothesis: it's probably the flush causing the CMS, not the snapshot linking.

Confirmation possibility #1: Add a logger.warn to
CLibrary.createHardLinkWithExec -- with JNA enabled it shouldn't be
called, but let's rule it out.

Confirmation possibility #2: Force some flushes w/o snapshot.

Either way: "concurrent mode failure" is the easy GC problem.
Hopefully you really are seeing mostly that -- this means the JVM
didn't start CMS early enough, so it ran out of space before it could
finish the concurrent collection, so it falls back to stop-the-world.
The fix is a combination of reducing XX:CMSInitiatingOccupancyFraction
and (possibly) increasing heap capacity if your heap is simply too
full too much of the time.

You can also mitigate it by increasing the phi threshold for the
failure detector, so the node doing the GC doesn't mark everyone else
as dead.

(Eventually your heap will fragment and you will see STW collections
due to "promotion failed," but you should see that much less
frequently. GC tuning to reduce fragmentation may be possible based on
your workload, but that's out of scope here and in any case the "real"
fix for that is https://issues.apache.org/jira/browse/CASSANDRA-2252.)

On Wed, Apr 6, 2011 at 2:07 PM, C. Scott Andreas
<cs...@urbanairship.com> wrote:
> Hello,
>
> We're running a six-node 0.7.4 ring in EC2 on m1.xlarge instances with 4GB heap (15GB total memory, 4 cores, dataset fits in RAM, storage on ephemeral disk). We've noticed a brief flurry of query failures during the night corresponding with our backup schedule. More specifically, our logs suggest that calling "nodetool snapshot" on a node is triggering 12 to 16 second CMS GCs and a promotion failure resulting in a full stop-the-world collection, during which the node is marked dead by the ring until re-joining shortly after.
>
> Here's a log from one of the nodes, along with system info and JVM options: https://gist.github.com/e12c6cae500e118676d1
>
> At 13:15:00, our backup cron job runs, which calls nodetool flush, then nodetool snapshot. (After investigating, we noticed that calling both flush and snapshot is unnecessary, and have since updated the script to only call snapshot). While writing memtables, we'll generally see a GC logged out via Cassandra such as:
>
> "GC for ConcurrentMarkSweep: 16113 ms, 1755422432 reclaimed leaving 1869123536 used; max is 4424663040."
>
> In the JVM GC logs, we'll often see a tenured promotion failure occurring during this collection, resulting in a full stop-the-world GC like this (different node):
>
> 1180629.380: [CMS1180634.414: [CMS-concurrent-mark: 6.041/6.468 secs] [Times: user=8.00 sys=0.10, real=6.46 secs]
>  (concurrent mode failure): 3904635K->1700629K(4109120K), 16.0548910 secs] 3958389K->1700629K(4185792K), [CMS Perm : 19610K->19601K(32796K)], 16.1057040 secs] [Times: user=14.39 sys=0.02, real=16.10 secs]
>
> During the GC, the rest of the ring will shun the node, and when the collection completes, the node will mark all other hosts in the ring as dead. The node and ring stabilize shortly after once detecting each other as up and completing hinted handoff (details in log).
>
> We've enabled JNA on one of the nodes to prevent forking a subprocess to call `ln` during a snapshot yesterday and still observed a concurrent mode failure collection following a flush/snapshot, but the CMS length was shorter (9 seconds) and did not result in the node being shunned from the ring.
>
> While the query failures that result from this activity are brief, our retry threshold is set to 6 for timeout exceptions. We're concerned that we're exceeding that, and would like to figure out why we see long CMS collections + promotion failures triggering full GCs during a snapshot.
>
> Has anyone seen this, or have suggestions on how to prevent full GCs from occurring during a flush / snapshot?
>
> Thanks,
>
> - Scott
>
> ---
>
> C. Scott Andreas
> Engineer, Urban Airship, Inc.
> http://www.urbanairship.com



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com