You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jonathan Ellis (JIRA)" <ji...@apache.org> on 2011/03/24 20:24:05 UTC

[jira] [Updated] (CASSANDRA-2006) Serverwide caps on memtable thresholds

     [ https://issues.apache.org/jira/browse/CASSANDRA-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-2006:
--------------------------------------

    Attachment: 2006.txt

Patch that optionally creates a global heap usage threshold and tries to keep total memtable size under that.

The two main points of interest are Memtable.updateLiveRatio and MeteredFlusher.

MeteredFlusher is what checks memory usage (once per second) and kicks of the flushes.  Note that naively flushing when we hit the threshold is wrong, since you can have multiple memtables in-flight during the flush process.  To address this, we track inactive but unflushed memtables and include those in our total. We also aggressively flush any memtable that reaches the level of "if my entire flush pipeline were full of memtables of this size, how big could I allow them to be."

Since counting each object's size is far too slow to be useful directly, we compute the ratio of serialized size to memory size in the background, and update that periodically; That is what updateLiveRatio does.  MeteredFlusher then bases its work on actual serialized size, multiplied by this ratio.

One last note: the config code is a little messy because we want to leave behavior unchanged (i.e.: only use old per-CF thresholds) if the setting is absent as it would be for an upgrader. But, we want a setting to allow "pick a reasonable default based on heap usage;" hence the distinction b/t null (off) and -1 (autocompute).

I tested by creating the stress schema, then modifying the per-CF settings to be multiple TB, so only the new global flusher affects things.  Then I created half a GB of commitlog files to reply -- CL replay hammers it much harder than even stress.java.

It was successful in preventing OOM (or even the "emergency flushing" at 85% of heap) but heap usage as reported by CMS was consistently about 25% higher than what MeteredFlusher thought it should be. It may be that we can fudge factor this; otherwise, tuning by watching CMS vs estimated size and adjusting the setting manually to compensate, is still much easier than the status quo of per-CF tuning.

To experiment, I recommend also patching the log4j settings as follows:

{noformat}
Index: conf/log4j-server.properties
===================================================================
--- conf/log4j-server.properties	(revision 1085010)
+++ conf/log4j-server.properties	(working copy)
@@ -35,7 +35,8 @@
 log4j.appender.R.File=/var/log/cassandra/system.log
 
 # Application logging options
-#log4j.logger.org.apache.cassandra=DEBUG
+log4j.logger.org.apache.cassandra.service.GCInspector=DEBUG
+log4j.logger.org.apache.cassandra.db.MeteredFlusher=DEBUG
 #log4j.logger.org.apache.cassandra.db=DEBUG
 #log4j.logger.org.apache.cassandra.service.StorageProxy=DEBUG
{noformat}


> Serverwide caps on memtable thresholds
> --------------------------------------
>
>                 Key: CASSANDRA-2006
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2006
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Stu Hood
>             Fix For: 0.8
>
>         Attachments: 2006.txt
>
>
> By storing global operation and throughput thresholds, we could eliminate the "many small memtables" problem caused by having many CFs. The global threshold would be set in the config file, to allow different classes of servers to have different values configured.
> Operations occurring in the memtable would add to the global counters, in addition to the memtable-local counters. When a global threshold was violated, the memtable in the system that was using the largest fraction of it's local threshold would be flushed. Local thresholds would continue to act as they always have.
> The result would be larger sstables, safer operation with multiple CFs and per node tuning.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira