You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2015/09/04 21:29:45 UTC
[jira] [Commented] (FLINK-1320) Add an off-heap variant of the managed memory

    [ https://issues.apache.org/jira/browse/FLINK-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731294#comment-14731294 ] 

ASF GitHub Bot commented on FLINK-1320:
---------------------------------------

GitHub user StephanEwen opened a pull request:

    https://github.com/apache/flink/pull/1093

    [FLINK-1320] [core] Add an off-heap variant of the managed memory

    This pull request extends the Flink managed memory to work transparently on-heap and off-heap.
    
    In Flink's core were always memory management techniques that operated on serialized data (rather than objects) on pre-reserved memory segments on the heap. This has the huge benefit that the memory usage can be exactly controlled, and that the Garbage Collector was "tamed" by now accumulating billions of long objects lived objects on the heap.
    
    One could say that Flink was using off-heap style techniques with on-heap memory. With this pull request, the system can use memory outside the heap, as an alternative to heap memory. The benefit is that this gets the managed memory ozt of the GC's responsibility, and vastly reduces the overall heap size (since often 75% of the memory is used as Flink's managed memory, for sorting, hashing, caching, ...).
    
    That way, Flink processes can be scaled to 100s if Gigabytes of main memory.
    
    
    ## Usage
    
    The pull request introduces one simple config flag: `taskmanager.memory.off-heap: true`
    
      - In heap mode, the TaskManager takes 0.7 of the free heap for the Memory Manager (same as before)
      - In off-heap mode, the TaskManager takes 3x the size of the heap for the Memory Manager. This implies that the startup scripts should scale the heap size doen to 1/4th, since all big system memory consumers (network, memory manager) now allocate outside the heap.
      - One can control the off-heap memory size via `taskmanager.memory.size` and `taskmanager.memory.off-heap-ratio`.
    
    
    ## Design decisions
    
    1. Since the operations on Flink's memory segments are in the innermost loops, they are super performance critical. All code is written to be very JIT friendly.
    
    2. Even though the bulk of the memory may be off-heap (if that mode is chosen), there are various points where short-lived unpooled managed heap memory is needed, so n the off-heap case, we need to support mixed memory (on- and off-heap simultaneously).
    
    3. While benchmarking various possible implementations, I found that many operations can be written to transparently work on on-heap and off-heap memory. The only notable exception are individual byte accesses, which are faster on specialized heap implementations (up to 50% faster, see benchmarks below).
    
    4. Because individual byte operations occur a lot for Strings and tags/headers, I did not want to compromise on their performance.
    
      We now have the following MemorySegment implementations:
    
        - `MemorySegment`: Implements all multi-byte type accesses (ints, longs, doubles) and copy/compare/swap methods to transparently work on heap/off-heap memory.
      
        - `HeapMemorySegment`: Optimized single-byte operations, and bulk byte transfers. Used for on-heap case.
    
        - `HybridMemorySegment`: Transparent on-heap/off-heap single byte operations and bulk transfers, slightly slower than the HeapMemorySegment. Used for off-heap case, and can also represent the unpooled on-heap memory at the same time.
    
      Effectively, performance of some off-heap memory operations are a bit slower than on-heap operations. I think that cannot be avoided, and in this implementation, we have a way to keep peak-performance of managed memory on heap.
      
    5. For optimal JIT efficiency, many methods are marked as final. In addition, performance increases if only one of the MemorySegment specializations is used at a time, and not both are mixed. That way the JIT can optimally de-virtualize and inline methods. Otherwise, it has to switch between optimization/deoptimization or use bimorphic inlining. Since the memory segment accesses are the absolute innermost loops, we care about these last few CPU cycles. The benchmarks below illustrate this.
    
      Direct instantiation of the memory segments is not easily possible, but access must go through a `MemorySegmentFactory` that manages which implementation to load and use.
    
    6. The implementations of the individual memory segment functions take care of the following:
        - Use methods that are JIT intrinsics, i.e., get jitted to optimized native instructions.
        - Use collapsed checks for range checks and disposal checks.
        - Avoids branches between heap/off-heap cases, but use uniform code path (for primitive type accesses).
        - The code paths for valid (non exception) cases are always the first option at branches, for single element operations.
        - As long as the memory segment object exists, all operations are guaranteed to work and not lead to segmentation faults, if the segment has been concurrently freed.
    
    
    ## Tests 
    
    More than 6k lines of tests in total, for all types of memory segments and memory (heap, hybrid-on-heap, hybrid-off-heap) and interoperability between them. Tests include regular operation, attempts on released memory, bound violations, interoperability with other heap and off-heap memory, checks for correct exception types. 
    
    Test coverage reports shows that virtually all conditionals are tested, except for the once that are effective only on big-endian processor architectures, and otherwise optimized away by class-loading/JIT.
    
    
    ## Open issues
    
      - The TaskManager startup script needs to be adjusted to interpret the `taskmanager.memory.off-heap` config entry and adjust the heap size and off-heap size according to `taskmanager.memory.size` and `taskmanager.memory.off-heap-ratio`.
      - Has not yet been integrated with the YarnTaskManagerRunner.
      - Needs some cluster testing
    
    ## Microbenchmarks of the Memory Segments
    
    These microbenchmarks test the performance of the memory segments on various operation.
    The code is under `flink-core/src/test/java/org/apache/flink/core/memory/benchmarks/MemorySegmentSpeedBenchmark.java`
    
      - Oracle Java 8 (1.8.0_25)
      - 4GB heap (the experiments need 2.2 GB)
      - Intel Core i7-4700MQ CPU, 2.40GHz (4 cores, 8 hardware contexts)
    
      - All experiments were run 5x, discaring the fastest and slowest run, and then averaged. This compensated for delay before the JIT kicks in.
    
      - Each run tests the different implementations in different orders, to balance the advantage/disadvantage of the JIT specializing towards certain code paths.
    
    The tested implementations are
    
      - `HeapMemorySegment` (exclusive): The case where it is the only loaded MemorySegment class.
      - `HeapMemorySegment` (mixed): The case where both the `HeapMemorySegment` and the `HybridMemorySegment` are loaded.
      - `HybridMemorySegment` (heap-exclusive): Backed by heap memory, and the case where it is the only loaded MemorySegment class.
      - `HybridMemorySegment` (heap-mixed): Backed by heap memory, and the case where both the `HeapMemorySegment` and the `HybridMemorySegment` are loaded.
      - `HybridMemorySegment` (off-heap-exclusive): Backed by off-heap memory, and the case where it is the only loaded MemorySegment class.
      - `HybridMemorySegment` (off-heap-mixed): Backed by heap off-memory, and the case where both the `HeapMemorySegment` and the `HybridMemorySegment` are loaded.
      - `PureHeapSegment`, which has no class hierarchy and virtual methods at all.
      - `PureHybridSegment` (heap), which has no class hierarchy and virtual methods at all, backed by heap memory.
      - `PureHybridSegment` (off-heap), which has no class hierarchy and virtual methods at all, backed by off-heap memory.
    
    
    A summary of the experiments:
      - The hybrid memory segment performs equally well in heap and off-heap memory
      - The heap memory segments are quite a bit better in reading individual bytes, not so much at writing them.
      - The abstract class MemorySegment (with its subclasses `HeapMemorySegment` and `HybridMemorySegment`) performs as well as any specialized non-abstract class, as long as only one subclass is loaded. When both are loaded, performance may suffer by a factor of more than 2x on certain operations.
      - How badly the performance degrades seems to depend a lot on which subclass is tested before and after which. Sometimes, performance is affected more than other times. This leads to the design to try and load at most one subclass of the memory segment, to get the most predictable and best performance.
    
      
    ### Byte accesses
    
    **Writing 100000 x 32768 bytes to 32768 bytes segment**
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, exclusive             | 1,441 msecs
    `HeapMemorySegment`, mixed                 | 3,841 msecs
    `HybridMemorySegment`, heap, exclusive     | 1,626 msecs
    `HybridMemorySegment`, off-heap, exclusive | 1,628 msecs
    `HybridMemorySegment`, heap, mixed         | 3,848 msecs
    `HybridMemorySegment`, off-heap, mixed     | 3,847 msecs
    `PureHeapSegment`                          | 1,442 msecs
    `PureHybridSegment`, heap                  | 1,623 msecs
    `PureHybridSegment`, off-heap              | 1,620 msecs
    
    **Reading 100000 x 32768 bytes from 32768 bytes segment**
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, exclusive             | 1,326 msecs
    `HeapMemorySegment`, mixed                 | 1,378 msecs
    `HybridMemorySegment`, heap, exclusive     | 2,029 msecs
    `HybridMemorySegment`, off-heap, exclusive | 2,030 msecs
    `HybridMemorySegment`, heap, mixed         | 2,047 msecs
    `HybridMemorySegment`, off-heap, mixed     | 2,049 msecs
    `PureHeapSegment`                          | 1,331 msecs
    `PureHybridSegment`, heap                  | 2,030 msecs
    `PureHybridSegment`, off-heap              | 2,030 msecs
    
    **Writing 10 x 1073741824 bytes to 1073741824 bytes segment**
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, exclusive             | 5,602 msecs
    `HeapMemorySegment`, mixed                 | 12,570 msecs
    `HybridMemorySegment`, heap, exclusive     | 5,391 msecs
    `HybridMemorySegment`, off-heap, exclusive | 5,391 msecs
    `HybridMemorySegment`, heap, mixed         | 12,566 msecs
    `HybridMemorySegment`, off-heap, mixed     | 12,556 msecs
    `PureHeapSegment`                          | 5,599 msecs
    `PureHybridSegment`, heap                  | 5,387 msecs
    `PureHybridSegment`, off-heap              | 5,381 msecs
    
    **Reading 10 x 1073741824 bytes from 1073741824 bytes segment**
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, exclusive             | 4,243 msecs
    `HeapMemorySegment`, mixed                 | 4,265 msecs
    `HybridMemorySegment`, heap, exclusive     | 6,730 msecs
    `HybridMemorySegment`, off-heap, exclusive | 6,725 msecs
    `HybridMemorySegment`, heap, mixed         | 6,933 msecs
    `HybridMemorySegment`, off-heap, mixed     | 6,926 msecs
    `PureHeapSegment`                          | 4,247 msecs
    `PureHybridSegment`, heap                  | 6,919 msecs
    `PureHybridSegment`, off-heap              | 6,916 msecs
    
    ### Byte Array accesses
    
    **Writing 100000 x 32 byte[1024] to 32768 bytes segment**
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, mixed                 | 164 msecs
    `HybridMemorySegment`, heap, mixed         | 163 msecs
    `HybridMemorySegment`, off-heap, mixed     | 163 msecs
    `PureHeapSegment`                          | 165 msecs
    `PureHybridSegment`, heap                  | 182 msecs
    `PureHybridSegment`, off-heap              | 176 msecs
    
    **Reading 100000 x 32 byte[1024] from 32768 bytes segment**
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, mixed                 | 157 msecs
    `HybridMemorySegment`, heap, mixed         | 155 msecs
    `HybridMemorySegment`, off-heap, mixed     | 162 msecs
    `PureHeapSegment`                          | 161 msecs
    `PureHybridSegment`, heap                  | 175 msecs
    `PureHybridSegment`, off-heap              | 179 msecs
    
    **Writing 10 x 1048576 byte[1024] to 1073741824 bytes segment** 
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, mixed                 | 1,164 msecs
    `HybridMemorySegment`, heap, mixed         | 1,173 msecs
    `HybridMemorySegment`, off-heap, mixed     | 1,157 msecs
    `PureHeapSegment`                          | 1,169 msecs
    `PureHybridSegment`, heap                  | 1,174 msecs
    `PureHybridSegment`, off-heap              | 1,166 msecs
    
    **Reading 10 x 1048576 byte[1024] from 1073741824 bytes segment**
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, mixed                 | 854 msecs
    `HybridMemorySegment`, heap, mixed         | 853 msecs
    `HybridMemorySegment`, off-heap, mixed     | 854 msecs
    `PureHeapSegment`                          | 857 msecs
    `PureHybridSegment`, heap                  | 896 msecs
    `PureHybridSegment`, off-heap              | 887 msecs
    
    
    ### Long integer accesses
    
    *(note that the heap and off-heap segments use the same or comparable code for this)*
    
    **Writing 100000 x 4096 longs to 32768 bytes segment**
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, mixed                 | 221 msecs
    `HybridMemorySegment`, heap, mixed         | 222 msecs
    `HybridMemorySegment`, off-heap, mixed     | 221 msecs
    `PureHeapSegment`                          | 194 msecs
    `PureHybridSegment`, heap                  | 220 msecs
    `PureHybridSegment`, off-heap              | 221 msecs
    
    **Reading 100000 x 4096 longs from 32768 bytes segment**
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, mixed                 | 233 msecs
    `HybridMemorySegment`, heap, mixed         | 232 msecs
    `HybridMemorySegment`, off-heap, mixed     | 231 msecs
    `PureHeapSegment`                          | 232 msecs
    `PureHybridSegment`, heap                  | 232 msecs
    `PureHybridSegment`, off-heap              | 233 msecs
    
    **Writing 10 x 134217728 longs to 1073741824 bytes segment**
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, mixed                 | 1,120 msecs
    `HybridMemorySegment`, heap, mixed         | 1,120 msecs
    `HybridMemorySegment`, off-heap, mixed     | 1,115 msecs
    `PureHeapSegment`                          | 1,148 msecs
    `PureHybridSegment`, heap                  | 1,116 msecs
    `PureHybridSegment`, off-heap              | 1,113 msecs
    
    **Reading 10 x 134217728 longs from 1073741824 bytes segment**
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, mixed                 | 1,097 msecs
    `HybridMemorySegment`, heap, mixed         | 1,099 msecs
    `HybridMemorySegment`, off-heap, mixed     | 1,093 msecs
    `PureHeapSegment`                          | 917 msecs
    `PureHybridSegment`, heap                  | 1,105 msecs
    `PureHybridSegment`, off-heap              | 1,097 msecs
    
    
    ### Integer accesses
    
    *(note that the heap and off-heap segments use the same or comparable code for this)*
    
    **Writing 100000 x 8192 ints to 32768 bytes segment**
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, mixed                 | 578 msecs
    `HybridMemorySegment`, heap, mixed         | 580 msecs
    `HybridMemorySegment`, off-heap, mixed     | 576 msecs
    `PureHeapSegment`                          | 624 msecs
    `PureHybridSegment`, heap                  | 576 msecs
    `PureHybridSegment`, off-heap              | 578 msecs
    
    **Reading 100000 x 8192 ints from 32768 bytes segment**
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, mixed                 | 464 msecs
    `HybridMemorySegment`, heap, mixed         | 464 msecs
    `HybridMemorySegment`, off-heap, mixed     | 465 msecs
    `PureHeapSegment`                          | 463 msecs
    `PureHybridSegment`, heap                  | 464 msecs
    `PureHybridSegment`, off-heap              | 463 msecs
    
    **Writing 10 x 268435456 ints to 1073741824 bytes segment**
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, mixed                 | 2,187 msecs
    `HybridMemorySegment`, heap, mixed         | 2,161 msecs
    `HybridMemorySegment`, off-heap, mixed     | 2,152 msecs
    `PureHeapSegment`                          | 2,770 msecs
    `PureHybridSegment`, heap                  | 2,161 msecs
    `PureHybridSegment`, off-heap              | 2,157 msecs
    
    **Reading 10 x 268435456 ints from 1073741824 bytes segment**
    
    Segment                                    | Time
    ------------------------------------------ | -------------
    `HeapMemorySegment`, mixed                 | 1,782 msecs
    `HybridMemorySegment`, heap, mixed         | 1,783 msecs
    `HybridMemorySegment`, off-heap, mixed     | 1,774 msecs
    `PureHeapSegment`                          | 1,501 msecs
    `PureHybridSegment`, heap                  | 1,774 msecs
    `PureHybridSegment`, off-heap              | 1,771 msecs
    
    
    
    There is still a bit of mystery left, specifically why sometimes code is faster when it performs more checks (has more instructions and an additional branch). Even though the branch is perfectly predictable, this seems counter-intuitive. The only explanation that I could come up with is that the JIT does a better register allocation in that case (for what ever reason, maybe the instructions just fit the algorithm better).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/StephanEwen/incubator-flink mem_mixed

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1093.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1093
    
----
commit 2c359d7cb83daa990e4fe47c69c6a8bae018e534
Author: Stephan Ewen <se...@apache.org>
Date:   2015-08-30T20:36:46Z

    [FLINK-1320] [core] Add an off-heap variant of the managed memory
    
    This closes #290

----


> Add an off-heap variant of the managed memory
> ---------------------------------------------
>
>                 Key: FLINK-1320
>                 URL: https://issues.apache.org/jira/browse/FLINK-1320
>             Project: Flink
>          Issue Type: Improvement
>          Components: Local Runtime
>            Reporter: Stephan Ewen
>            Assignee: niraj rai
>            Priority: Minor
>
> For (nearly) all memory that Flink accumulates (in the form of sort buffers, hash tables, caching), we use a special way of representing data serialized across a set of memory pages. The big work lies in the way the algorithms are implemented to operate on pages, rather than on objects.
> The core class for the memory is the {{MemorySegment}}, which has all methods to set and get primitives values efficiently. It is a somewhat simpler (and faster) variant of a HeapByteBuffer.
> As such, it should be straightforward to create a version where the memory segment is not backed by a heap byte[], but by memory allocated outside the JVM, in a similar way as the NIO DirectByteBuffers, or the Netty direct buffers do it.
> This may have multiple advantages:
>   - We reduce the size of the JVM heap (garbage collected) and the number and size of long living alive objects. For large JVM sizes, this may improve performance quite a bit. Utilmately, we would in many cases reduce JVM size to 1/3 to 1/2 and keep the remaining memory outside the JVM.
>   - We save copies when we move memory pages to disk (spilling) or through the network (shuffling / broadcasting / forward piping)
> The changes required to implement this are
>   - Add a {{UnmanagedMemorySegment}} that only stores the memory adress as a long, and the segment size. It is initialized from a DirectByteBuffer.
>   - Allow the MemoryManager to allocate these MemorySegments, instead of the current ones.
>   - Make sure that the startup script pick up the mode and configure the heap size and the max direct memory properly.
> Since the MemorySegment is probably the most performance critical class in Flink, we must take care that we do this right. The following are critical considerations:
>   - If we want both solutions (heap and off-heap) to exist side-by-side (configurable), we must make the base MemorySegment abstract and implement two versions (heap and off-heap).
>   - To get the best performance, we need to make sure that only one class gets loaded (or at least ever used), to ensure optimal JIT de-virtualization and inlining.
>   - We should carefully measure the performance of both variants. From previous micro benchmarks, I remember that individual byte accesses in DirectByteBuffers (off-heap) were slightly slower than on-heap, any larger accesses were equally good or slightly better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)