You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Robert Stupp (JIRA)" <ji...@apache.org> on 2015/01/06 11:09:36 UTC

[jira] [Comment Edited] (CASSANDRA-7438) Serializing Row cache alternative (Fully off heap)

    [ https://issues.apache.org/jira/browse/CASSANDRA-7438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257721#comment-14257721 ] 

Robert Stupp edited comment on CASSANDRA-7438 at 1/6/15 10:08 AM:
------------------------------------------------------------------

I had the opportunity to test OHC on a big machine.
First: it works - very happy about that :)

Some things I want to notice:
* high number of segments do not have any really measurable influence (default of 2* # of cores is fine)
* throughput heavily depends on serialization (hash entry size) - Java8 gave about 10% to 15% improvement in some tests (either on {{Unsafe.copyMemory}} or something related like JNI barrier)
* the number of entries per bucket stays pretty low with the default load factor of .75 - vast majority has 0 or 1 entries, some 2 or 3 and few up to 8

Issue (not solvable yet):
It works great for hash entries to approx. 64kB with good to great throughput. Above that barrier it first works good but after some time the system spends a huge amount of CPU time (~95%) in {{malloc()}} / {{free()}} (with jemalloc, Unsafe.allocate is not worth discussing at all on Linux).
I tried to add some „memory buffer cache“ that caches free’d hash entries for reuse. But it turned out that in the end it would be too complex if done right. The current implementation is still in the code, but must be explicitly enabled with a system property. Workloads with small entries and high number of threads easily trigger Linux OOM protection (that kills the process). Please note that it works with large hash entries - but throughput drops dramatically to just a few thousand writes per second.

Some numbers (value sizes have gaussian distribution). Had to do these tests in a hurry because I had to give back the machine. Code used during these tests is tagged as {{0.1-SNAP-Bench}} in git. Throughput is limited by {{malloc()}} / {{free()}} and most tests did only use 50% of available CPU capacity (on _c3.8xlarge_ - 32 cores, Intel Xeon E5-2680v2 @2.8GHz, 64GB).
* -1k..200k value size, 32 threads, 1M keys, 90% read ratio, 32GB: 22k writes/sec, 200k reads/sec, ~8k evictions/sec, write: 8ms (99perc), read: 3ms(99perc)-
* -1k..64k value size, 500 threads, 1M keys, 90% read ratio, 32GB: 55k writes/sec, 499k reads/sec, ~2k evictions/sec, write: .1ms (99perc), read: .03ms(99perc)-
* -1k..64k value size, 500 threads, 1M keys, 50% read ratio, 32GB: 195k writes/sec, 195k reads/sec, ~9k evictions/sec, write: .2ms (99perc), read: .1ms(99perc)-
* -1k..64k value size, 500 threads, 1M keys, 10% read ratio, 32GB: 185k writes/sec, 20k reads/sec, ~7k evictions/sec, write: 4ms (99perc), read: .07ms(99perc)-
* -1k..16k value size, 500 threads, 5M keys, 90% read ratio, 32GB: 110k writes/sec, 1M reads/sec, 30k evictions/sec, write: .04ms (99perc), read: .01ms(99perc)-
* -1k..16k value size, 500 threads, 5M keys, 50% read ratio, 32GB: 420k writes/sec, 420k reads/sec, 125k evictions/sec, write: .06ms (99perc), read: .01ms(99perc)-
* -1k..16k value size, 500 threads, 5M keys, 10% read ratio, 32GB: 435k writes/sec, 48k reads/sec, 130k evictions/sec, write: .06ms (99perc), read: .01ms(99perc)-
* -1k..4k value size, 500 threads, 20M keys, 90% read ratio, 32GB: 140k writes/sec, 1.25M reads/sec, 50k evictions/sec, write: .02ms (99perc), read: .005ms(99perc)-
* -1k..4k value size, 500 threads, 20M keys, 50% read ratio, 32GB: 530k writes/sec, 530k reads/sec, 220k evictions/sec, write: .04ms (99perc), read: .005ms(99perc)-
* -1k..4k value size, 500 threads, 20M keys, 10% read ratio, 32GB: 665k writes/sec, 74k reads/sec, 250k evcictions/sec, write: .04ms (99perc), read: .005ms(99perc)-

Command line to execute the benchmark:
{code}
java -jar ohc-benchmark/target/ohc-benchmark-0.1-SNAPSHOT.jar -rkd 'uniform(1..20000000)' -wkd 'uniform(1..20000000)' -vs 'gaussian(1024..4096,2)' -r .1 -cap 32000000000 -d 86400 -t 500 -dr 8

-r = read rate
-d = duration
-t = # of threads
-dr = # of driver threads that feed the worker threads
-rkd = read key distribution
-wkd = write key distribution
-vs = value size
-cap = capacity
{code}

Sample bucket histogram from 20M test:
{code}
    [0..0]: 8118604
    [1..1]: 5892298
    [2..2]: 2138308
    [3..3]: 518089
    [4..4]: 94441
    [5..5]: 13672
    [6..6]: 1599
    [7..7]: 189
    [8..9]: 16
{code}

After trapping into that memory management issue with varying allocation sized of some few kB to several MB, I think that it’s still worth to work on an own off-heap memory management. Maybe some block-based approach (fixed or variable). But that’s out of the scope of this ticket.

EDIT: The problem with high system-CPU usage only persists on systems with multiple CPUs. Cross check with the second CPU socket disabled - calling the benchmark with {{taskset 0x3ff java -jar ...}}  does not show 95% system CPU usage.

EDIT2: Marked benchmark values as invalid (see my comment on 01/06/15)


was (Author: snazy):
I had the opportunity to test OHC on a big machine.
First: it works - very happy about that :)

Some things I want to notice:
* high number of segments do not have any really measurable influence (default of 2* # of cores is fine)
* throughput heavily depends on serialization (hash entry size) - Java8 gave about 10% to 15% improvement in some tests (either on {{Unsafe.copyMemory}} or something related like JNI barrier)
* the number of entries per bucket stays pretty low with the default load factor of .75 - vast majority has 0 or 1 entries, some 2 or 3 and few up to 8

Issue (not solvable yet):
It works great for hash entries to approx. 64kB with good to great throughput. Above that barrier it first works good but after some time the system spends a huge amount of CPU time (~95%) in {{malloc()}} / {{free()}} (with jemalloc, Unsafe.allocate is not worth discussing at all on Linux).
I tried to add some „memory buffer cache“ that caches free’d hash entries for reuse. But it turned out that in the end it would be too complex if done right. The current implementation is still in the code, but must be explicitly enabled with a system property. Workloads with small entries and high number of threads easily trigger Linux OOM protection (that kills the process). Please note that it works with large hash entries - but throughput drops dramatically to just a few thousand writes per second.

Some numbers (value sizes have gaussian distribution). Had to do these tests in a hurry because I had to give back the machine. Code used during these tests is tagged as {{0.1-SNAP-Bench}} in git. Throughput is limited by {{malloc()}} / {{free()}} and most tests did only use 50% of available CPU capacity (on _c3.8xlarge_ - 32 cores, Intel Xeon E5-2680v2 @2.8GHz, 64GB).
* 1k..200k value size, 32 threads, 1M keys, 90% read ratio, 32GB: 22k writes/sec, 200k reads/sec, ~8k evictions/sec, write: 8ms (99perc), read: 3ms(99perc)
* 1k..64k value size, 500 threads, 1M keys, 90% read ratio, 32GB: 55k writes/sec, 499k reads/sec, ~2k evictions/sec, write: .1ms (99perc), read: .03ms(99perc)
* 1k..64k value size, 500 threads, 1M keys, 50% read ratio, 32GB: 195k writes/sec, 195k reads/sec, ~9k evictions/sec, write: .2ms (99perc), read: .1ms(99perc)
* 1k..64k value size, 500 threads, 1M keys, 10% read ratio, 32GB: 185k writes/sec, 20k reads/sec, ~7k evictions/sec, write: 4ms (99perc), read: .07ms(99perc)
* 1k..16k value size, 500 threads, 5M keys, 90% read ratio, 32GB: 110k writes/sec, 1M reads/sec, 30k evictions/sec, write: .04ms (99perc), read: .01ms(99perc)
* 1k..16k value size, 500 threads, 5M keys, 50% read ratio, 32GB: 420k writes/sec, 420k reads/sec, 125k evictions/sec, write: .06ms (99perc), read: .01ms(99perc)
* 1k..16k value size, 500 threads, 5M keys, 10% read ratio, 32GB: 435k writes/sec, 48k reads/sec, 130k evictions/sec, write: .06ms (99perc), read: .01ms(99perc)
* 1k..4k value size, 500 threads, 20M keys, 90% read ratio, 32GB: 140k writes/sec, 1.25M reads/sec, 50k evictions/sec, write: .02ms (99perc), read: .005ms(99perc)
* 1k..4k value size, 500 threads, 20M keys, 50% read ratio, 32GB: 530k writes/sec, 530k reads/sec, 220k evictions/sec, write: .04ms (99perc), read: .005ms(99perc)
* 1k..4k value size, 500 threads, 20M keys, 10% read ratio, 32GB: 665k writes/sec, 74k reads/sec, 250k evcictions/sec, write: .04ms (99perc), read: .005ms(99perc)

Command line to execute the benchmark:
{code}
java -jar ohc-benchmark/target/ohc-benchmark-0.1-SNAPSHOT.jar -rkd 'uniform(1..20000000)' -wkd 'uniform(1..20000000)' -vs 'gaussian(1024..4096,2)' -r .1 -cap 32000000000 -d 86400 -t 500 -dr 8

-r = read rate
-d = duration
-t = # of threads
-dr = # of driver threads that feed the worker threads
-rkd = read key distribution
-wkd = write key distribution
-vs = value size
-cap = capacity
{code}

Sample bucket histogram from 20M test:
{code}
    [0..0]: 8118604
    [1..1]: 5892298
    [2..2]: 2138308
    [3..3]: 518089
    [4..4]: 94441
    [5..5]: 13672
    [6..6]: 1599
    [7..7]: 189
    [8..9]: 16
{code}

After trapping into that memory management issue with varying allocation sized of some few kB to several MB, I think that it’s still worth to work on an own off-heap memory management. Maybe some block-based approach (fixed or variable). But that’s out of the scope of this ticket.

EDIT: The problem with high system-CPU usage only persists on systems with multiple CPUs. Cross check with the second CPU socket disabled - calling the benchmark with {{taskset 0x3ff java -jar ...}}  does not show 95% system CPU usage.

> Serializing Row cache alternative (Fully off heap)
> --------------------------------------------------
>
>                 Key: CASSANDRA-7438
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7438
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>         Environment: Linux
>            Reporter: Vijay
>            Assignee: Vijay
>              Labels: performance
>             Fix For: 3.0
>
>         Attachments: 0001-CASSANDRA-7438.patch, tests.zip
>
>
> Currently SerializingCache is partially off heap, keys are still stored in JVM heap as BB, 
> * There is a higher GC costs for a reasonably big cache.
> * Some users have used the row cache efficiently in production for better results, but this requires careful tunning.
> * Overhead in Memory for the cache entries are relatively high.
> So the proposal for this ticket is to move the LRU cache logic completely off heap and use JNI to interact with cache. We might want to ensure that the new implementation match the existing API's (ICache), and the implementation needs to have safe memory access, low overhead in memory and less memcpy's (As much as possible).
> We might also want to make this cache configurable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)