You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Jay Kreps (JIRA)" <ji...@apache.org> on 2014/10/02 18:46:34 UTC

[jira] [Commented] (SAMZA-428) Investigate: how to tune down caching in the KeyValueStore implementations

    [ https://issues.apache.org/jira/browse/SAMZA-428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14156731#comment-14156731 ] 

Jay Kreps commented on SAMZA-428:
---------------------------------

Let me give the rationale here. 

I agree that tuning caching in the setup we have is quite complex because there are effectively three levels:
1. Our in heap row cache
2. LevelDB/RocksDB uncompressed block cache
3. LevelDB/RocksDB compressed blocks cached in the filesystem

How to correctly allocate memory between these optimally is pretty workload specific.

The row cache (a) avoids serialization overhead, (b) avoids writes to Kafka and disk I/O entirely, (c) is extremely wasteful of memory. The memory waste is worth considering because of the number of java objects that end up cached, it is very unlikely you can get to more than 30% useful data versus object, heap, and data structure overhead. So for big chunks of memory I suspect the filesystem or RocksDB cache is better.

So why have an in-process cache at all? The rationale was that there are actually lots of simple cases that can be vastly improved with even a very small in-process cache. These are cases where you are incrementing a small number of counters over and over again. Logging out each change is very expensive and the serialization overhead is really high since each increment requires deserialization and reserialization.  By defaulting to just a small in-process cache I think we can make the case of a small data set pretty efficient out of the box at the cost of just a little bit of memory.

> Investigate: how to tune down caching in the KeyValueStore implementations
> --------------------------------------------------------------------------
>
>                 Key: SAMZA-428
>                 URL: https://issues.apache.org/jira/browse/SAMZA-428
>             Project: Samza
>          Issue Type: Improvement
>          Components: kv
>    Affects Versions: 0.8.0
>            Reporter: Chinmay Soman
>             Fix For: 0.8.0
>
>
> Currently, we have a 'CachedStore' layer on top of the KeyValueStore implementation that we use. This might lead to double caching:
> i) Once at the CachedStore layer
> ii) Possibly cached again in the specific K-V store that we use (for eg: RocksDB / BDB)
> We need the CachedStore layer so that the writes to LoggedStore (if configured) are done in an efficient manner. 
> We can then potentially do some config tuning for the K-V store to reduce its memory footprint and simply write to disk. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)