You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Bill Bejeck (JIRA)" <ji...@apache.org> on 2016/07/25 01:17:20 UTC

[jira] [Commented] (KAFKA-3973) Investigate feasibility of caching bytes vs. records

    [ https://issues.apache.org/jira/browse/KAFKA-3973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391229#comment-15391229 ] 

Bill Bejeck commented on KAFKA-3973:
------------------------------------

The results for the LRU investigation are below. There were three types of measurements taken:

1. The current tracking max by size cache (Control)
2. The cache tracking size by max memory (Object). Both keys and values were used in keeping track of total memory.  The max size for the cache was calculated by multiplying the memory of a key/value pair (taken using the MemoryMeter class from the jamm library https://github.com/jbellis/jamm) by the max size specified in the Control/Bytes cache.
3. Storing bytes in the cache (Bytes).  The max size of the cache in this case was done by size.  Both keys and values are serialized/deserialized.
4. I have attached the benchmarking class and the modified MemoryLRUCache class for reference.


While complete accuracy in java benchmarking can be difficult to achieve, the results of these benchmarks are sufficient from the perspective of how the differnt approaches compare to each other.

The cache was set to a max size of 500,000 (or in the memory based cache 500,000 * key/value memory size). Two rounds of 25 iterations each were run.  In the first round 500,000 put/get combinations were performed to measure behaviour when all records could fit in the cache.  The second round had 1,000,000 put/get combinations to measure performance with evictions.  There were also some benchmarks for raw serialization and memory tracking included as well.

As exepected Control group had the best performance.  The Object (memory tracking) was better than serialization only if the MemoryMeter.measure method was used.  However the MemoryMeter.measure only captures the amount of memory taken by the object itself, it does not take into account any other objects in the object graph.  For example here is debug statement showing the memory for the string "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec a porttitor felis. In vel dolor."


MemoryMeter.measure :
24

MemoryMeter.measureDeep :
root [java.lang.String] 232 bytes (24 bytes)
  |
  +--value [char[]] 208 bytes (208 bytes)

232

MemoryMeter.measure total ignores the char array hanging off String objects.  With this in mind we would be forced to use MemoryMeter.measureDeep to get an accurate meausure of objects being placed in the cache.  From the results below the MemoryMeter.measureDeep method had the slowest performance.

With these results in mind, it looks to me like storing bytes in the cache is best going forward.  

Final notes
1. Another tool Java Object Layout (http://openjdk.java.net/projects/code-tools/jol/) shows promise, but needs evaluation.
2. These benchmarks should be re-written with JMH (http://openjdk.java.net/projects/code-tools/jmh/).  But using JMH requires a separate module at a minimum, but the JMH Gradle pluging (https://github.com/melix/jmh-gradle-plugin) looks interesting as it gives the ability to integrate JMH benchmarking tests into an existing project.  Having a place to write/run JMH benhmarks could be beneficial to the project as a whole.  If this seems worthwhile, I will create a Jira ticket and look into adding the JMH plugin, or creating a separate benchmarking module.
3. Probably should add a benchmarking test utilizing the MemoryLRUCache as well.

Investigation Results
Tests for 500,000 inserts 500K count/500K * memory  max cache size 

Control       500K cache put/get results 25 iterations ave time (millis) 53.24
Object        500K cache put/get results 25 iterations ave time (millis) 250.88
Object(Deep)  500K cache put/get results 25 iterations ave time (millis) 1720.08
Bytes         500K cache put/get results 25 iterations ave time (millis) 288.92

Tests for 1,000,000 inserts 500K count/500K * memory  max cache size 

Control       1M cache put/get results 25 iterations ave time (millis) 227.48
Object        1M cache put/get results 25 iterations ave time (millis) 488.2
Object(Deep)  1M cache put/get results 25 iterations ave time (millis) 2575.04
Bytes         1M cache put/get results 25 iterations ave time (millis) 852.04

Raw timing of tracking memory (deep) for 500K Strings
Took [567] millis to track memory

Raw timing of tracking memory for 500K Strings
Took [92] millis to track memory

Raw timing of tracking memory (deep) for 500K ComplexObjects
Took [2813] millis to track memory

Raw timing of tracking memory for 500K ComplexObjects
Took [148] millis to track memory

Raw timing of serialization for 500K Strings
Took [133] millis to serialize

Raw timing of serialization for 500K ComplexObjects
Took [525] millis to serialize






> Investigate feasibility of caching bytes vs. records
> ----------------------------------------------------
>
>                 Key: KAFKA-3973
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3973
>             Project: Kafka
>          Issue Type: Sub-task
>          Components: streams
>            Reporter: Eno Thereska
>            Assignee: Bill Bejeck
>             Fix For: 0.10.1.0
>
>         Attachments: CachingPerformanceBenchmarks.java, MemoryLRUCache.java
>
>
> Currently the cache stores and accounts for records, not bytes or objects. This investigation would be around measuring any performance overheads that come from storing bytes or objects. As an outcome we should know whether 1) we should store bytes or 2) we should store objects. 
> If we store objects, the cache still needs to know their size (so that it can know if the object fits in the allocated cache space, e.g., if the cache is 100MB and the object is 10MB, we'd have space for 10 such objects). The investigation needs to figure out how to find out the size of the object efficiently in Java.
> If we store bytes, then we are serialising an object into bytes before caching it, i.e., we take a serialisation cost. The investigation needs measure how bad this cost can be especially for the case when all objects fit in cache (and thus any extra serialisation cost would show).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)