You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Radim Kolar <hs...@sendmail.cz> on 2011/12/26 23:01:55 UTC

better anti OOM

If node is low on memory 0.95+ heap used it can do:

1. stop repair
2. stop largest compaction
3. reduce number of compaction slots
4. switch compaction to single threaded

flushing largest memtable/ cache reduce is not enough

Re: better anti OOM

Posted by Edward Capriolo <ed...@gmail.com>.
I do major companions and I have ran into bloom filters causing oom. One
trick I did was using nodetool to lower the size of row/key caches before
triggering the compact and raising them after companion finished. As
suggested running with spare heap is a very good idea it lowers the chance
of a stop the world garbage collection. Also less free space usually means
more memory fragmentation and causes your system to work harder CPU.

it is counter intuitive to leave "free memory"  because you want to get the
large caches etc, but the overhead gives more stability which in the end
gives better performance.

On Tuesday, December 27, 2011, Peter Schuller <pe...@infidyne.com>
wrote:
>>> In general, don't expect to be able to run at close to heap capacity;
>>> there *will* be spikes.
>> i try to tune for 80% of heap.
>
> Just FYI, at 80% target heap usage you're likely to have fallbacks to
> full compacting GC:s is my guess. If you are doing analytics only and
> aren't latency critical, that's probably fine though.
>
> --
> / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
>

Re: better anti OOM

Posted by Peter Schuller <pe...@infidyne.com>.
>> In general, don't expect to be able to run at close to heap capacity;
>> there *will* be spikes.
> i try to tune for 80% of heap.

Just FYI, at 80% target heap usage you're likely to have fallbacks to
full compacting GC:s is my guess. If you are doing analytics only and
aren't latency critical, that's probably fine though.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: better anti OOM

Posted by Radim Kolar <hs...@sendmail.cz>.
 > How large is the bloom filters in total? I.e., sizes of the 
*-Filter.db files.
On moderate node about 6.5 GB, index sampling will be about 4 GB, heap 
12 gb.

 > In general, don't expect to be able to run at close to heap capacity; 
there *will* be spikes.
i try to tune for 80% of heap.

Re: better anti OOM

Posted by Peter Schuller <pe...@infidyne.com>.
> I will investigate situation more closely using gc via jconsole, but isn't
> bloom filter for new sstable entirely in memory? On disk there are only 2
> files Index and Data.
> -rw-r--r--  1 root  wheel   1388969984 Dec 27 09:25
> sipdb-tmp-hc-4634-Index.db
> -rw-r--r--  1 root  wheel  10965221376 Dec 27 09:25
> sipdb-tmp-hc-4634-Data.db
>
> Bloom filter can be that big. I have experience that if i trigger major
> compaction on 180 GB CF ( Compacted row mean size: 130) it will OOM node
> after 10 seconds, so i am sure that compactions eats memory pretty well.

Yes you're right, you'll definitely spike in memory usage whatever
amount corresponds to index sampling/BF for the thing being compacted.
This can be mitigated by never running full compactions (i.e., not
running 'nodetool compact'), but won't be gone all together.

Also, if your version doesn't yet have
https://issues.apache.org/jira/browse/CASSANDRA-2466 applied, another
side-effect is that the sudden large allocations for bloom filters can
cause promotion failures even if there is free memory.

> yes, it prints messages like heap is almost full and after some time it
> usually OOM during large compaction.

Ok, in that case it seems even more clear that you simply need a
larger heap. How large is the bloom filters in total? I.e., sizes of
the *-Filter.db files. In general, don't expect to be able to run at
close to heap capacity; there *will* be spikes.

In this particular case, leveled compaction in 1.0 should mitigate the
effect quite significantly since it only compacts up to 10% of the
data set at a time so memory usage should be considerably more even
(as will disk space usage be). That would allow you to run a bit
closer to heap capacity than regular compaction.

Also, consider tweaking compaction throughput settings to control the
rate of allocation generated during a compaction, even if you don't
need it for disk I/O purposes.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: better anti OOM

Posted by Radim Kolar <hs...@sendmail.cz>.
> I don't know what you are basing that on. It seems unlikely to me that
> the working set of a compaction is 600 MB. However, it may very well
> be that the allocation rate is such that it contributes to an
> additional 600 MB average heap usage after a CMS phase has completed.
I will investigate situation more closely using gc via jconsole, but 
isn't bloom filter for new sstable entirely in memory? On disk there are 
only 2 files Index and Data.
-rw-r--r--  1 root  wheel   1388969984 Dec 27 09:25 
sipdb-tmp-hc-4634-Index.db
-rw-r--r--  1 root  wheel  10965221376 Dec 27 09:25 
sipdb-tmp-hc-4634-Data.db

Bloom filter can be that big. I have experience that if i trigger major 
compaction on 180 GB CF ( Compacted row mean size: 130) it will OOM node 
after 10 seconds, so i am sure that compactions eats memory pretty well.

 > Also, you say it's "pretty dead". What exactly does that mean? Does 
it OOM?
yes, it prints messages like heap is almost full and after some time it 
usually OOM during large compaction.

The easiest fix is probably to increase the heap size. I know this 
e-mail doesn't begin to explain details but it's such a long story.

Actually there is lack of decent documentation about cassandra memory 
and GC tuning.
datastax recommends this: (memtable_total_space_in_mb) + 1GB + 
(key_cache_size_estimate). which will work only for small tables.

Re: better anti OOM

Posted by Peter Schuller <pe...@infidyne.com>.
> I suggest you describe exactly what the problem is you have and why you
> think stopping compaction/repair is the appropriate solution.
>
> compacting 41.7 GB CF with about 200 millions rows adds - 600 MB to heap,
> node logs messages like:

I don't know what you are basing that on. It seems unlikely to me that
the working set of a compaction is 600 MB. However, it may very well
be that the allocation rate is such that it contributes to an
additional 600 MB average heap usage after a CMS phase has completed.

> After node boot
> Heap Memory (MB) : 1157.98 / 1985.00
>
> disabled gossip + thrift, only compaction running
> Heap Memory (MB) : 1981.00 / 1985.00

Using "nodetool info" to monitor heap usage is not really useful
unless done continuously over time and observing the free heap after
CMS phases have completed. Regardless, the heap is always expected to
grow in usage to the occupancy trigger which kick-starts CMS. That
said, 1981/1985 does indicate a non-desirable state for Cassandra, but
it does not mean that compaction is "using" 600 mb as such (in terms
of live set). You might say that it implies >= 600 mb extra heap
required at your current heap size and GC settings.

If you want to understand what's happening I suggest attaching with
visualvm/jconsole and looking at the GC behavior, and run with
-XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-XX:+PrintGCDateStamps. When attached with visualvm/jconsole you can
hit "perform gc" and see how far it drops, to judge what the actual
live set is.

Also, you say it's "pretty dead". What exactly does that mean? Does it
OOM? I suspect you're just seeing fallbacks to full GC and long pauses
because you're allocating and promoting to old-gen fast enough that
CMS is just not keeping up; rather than it having to do with memory
"use" per say.

In your case, I suspect you simply need to run with a bigger heap or
reconfigure CMS to use additional threads for concurrent marking
(-XX:ParallelCMSThreads=XXX - try XXX = number of CPU cores for
example in this case). Alternatively, a larger young gen to avoid so
much getting promoted during compaction.

But really, in short: The easiest fix is probably to increase the heap
size. I know this e-mail doesn't begin to explain details but it's
such a long story.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: better anti OOM

Posted by Radim Kolar <hs...@sendmail.cz>.
I suggest you describe exactly what the problem is you have and why you 
think stopping compaction/repair is the appropriate solution.

compacting 41.7 GB CF with about 200 millions rows adds - 600 MB to 
heap, node logs messages like:

  WARN [ScheduledTasks:1] 2011-12-27 00:20:57,972 GCInspector.java (line 
146) Heap is 0.9712791382159731 full.  You may need to reduce memtable 
and/or cache sizes.  Cassandra will now flush up to the two largest 
memtables to free up memory.  Adjust flush_largest_memtables_at 
threshold in cassandra.yaml if you don't want Cassandra to do this 
automatically
  INFO [ScheduledTasks:1] 2011-12-27 00:21:12,362 StorageService.java 
(line 2608) Unable to reduce heap usage since there are no dirty column 
families

And its pretty dead, killing compaction will make it alive again.

After node boot
Heap Memory (MB) : 1157.98 / 1985.00

disabled gossip + thrift, only compaction running
Heap Memory (MB) : 1981.00 / 1985.00

Re: better anti OOM

Posted by Peter Schuller <pe...@infidyne.com>.
> If node is low on memory 0.95+ heap used it can do:
>
> 1. stop repair
> 2. stop largest compaction
> 3. reduce number of compaction slots
> 4. switch compaction to single threaded
>
> flushing largest memtable/ cache reduce is not enough

Note that the "emergency" flushing is just a stop-gap. You should run
with appropriately sized heaps under normal conditions; the emergency
flushing stuff is intended to mitigate the effects of having a too
small heap size; it is not expected to avoid completely the
detrimental effects.

Also note that things like compaction does not normally contribute
significantly to the live size on your heap, but it typically does
contribute to allocation rate which can cause promotion failures or
concurrent mode failures if your heap size is too small and/or
concurrent mark/sweep settings not aggressive enough. Aborting
compaction wouldn't really help anything other than short-term
avoiding a fallback to full GC.

I suggest you describe exactly what the problem is you have and why
you think stopping compaction/repair is the appropriate solution.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)