You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by NicoK <gi...@git.apache.org> on 2017/08/01 08:41:03 UTC

[GitHub] flink pull request #4445: [FLINK-7310][core] always use the HybridMemorySegm...

GitHub user NicoK opened a pull request:

    https://github.com/apache/flink/pull/4445

    [FLINK-7310][core] always use the HybridMemorySegment

    ## What is the purpose of the change
    
    Since we'd like to use our own off-heap buffers for network communication, we cannot use `HeapMemorySegment` anymore and need to rely on `HybridMemorySegment`. We thus drop any code that loads the `HeapMemorySegment` (it is still available if needed) in favour of the `HybridMemorySegment` which is able to work on both heap and off-heap memory.
        
    For the performance penalty of this change compared to using `HeapMemorySegment` alone, see this interesting blob article (from 2015):
    https://flink.apache.org/news/2015/09/16/off-heap-memory.html
    
    ## Brief change log
    
      - drop any use of the `HeapMemorySegment` (however, for now, keep the class and its factory)
      - integrate `HybridMemorySegmentFactory` into `MemorySegmentFactory` (with hard-coded use of `HybridMemorySegment`)
    
    ## Verifying this change
    
    This change is already covered by existing tests, such as: memory-backend specific tests under `flink/core/memory` or actually all other tests running programs on Flink. Actually, the `HybridMemorySegment` was not really tested much in integration tests so far because most tests used on-heap memory and thus `HeapMemorySegment`. Since we now only use `HybridMemorySegment`, we do add a lot of tests for this.
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): (no)
      - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no)
      - The serializers: (no)
      - The runtime per-record code paths (performance sensitive): (yes)
      - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
    
    ## Documentation
    
      - Does this pull request introduce a new feature? (no)
      - If yes, how is the feature documented? (not applicable)
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/NicoK/flink flink-7310

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/4445.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4445
    
----
commit c62e793712effbfb53ea6442b5d714a68081f7ec
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-07-31T10:06:14Z

    [hotfix] fix some typos

commit d3c4e231a96b6ae133576a74646294749ab3809a
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-07-31T12:18:42Z

    [FLINK-7310][core] always use the HybridMemorySegment
    
    Since we'd like to use our own off-heap buffers for network communication, we
    cannot use HeapMemorySegment anymore and need to rely on HybridMemorySegment.
    We thus drop any code that loads the HeapMemorySegment (it is still available
    if needed) in favour of the HybridMemorySegment which is able to work on both
    heap and off-heap memory.
    
    For the performance penalty of this change compared to using HeapMemorySegment
    alone, see this interesting blob article (from 2015):
    https://flink.apache.org/news/2015/09/16/off-heap-memory.html

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #4445: [FLINK-7310][core] always use the HybridMemorySegment

Posted by NicoK <gi...@git.apache.org>.
Github user NicoK commented on the issue:

    https://github.com/apache/flink/pull/4445
  
    in a non-exhaustive mini benchmark, I ran `HashVsSortMiniBenchmark` and got the following results:
    
    # Best out of 5 (in ms)
    
    Test | `master` | `Flink-7310`
    ---- | ------ | ----------
    Hash Build First | 5541 | 5629
    Sort-Merge | 6194 | 6816
    Hash Build | 3587 | 3629
    
    # All results
    
    ## `master`
    
    Test | 1 | 2 | 3 | 4 | 5
    ---- | - | - | - | - | -
    Hash Build First | 5772.0 | 5541.0 | 5707.0 | 5733.0 | 5751.0
    Sort-Merge | 6704.0 | 7146.0 | 6194.0 | 6915.0 | 6445.0
    Hash Build Second | 3834.0 | 3805.0 | 3811.0 | 3587.0 | 3563.0
    
    ## `FLINK-7310`
    
    Test | 1 | 2 | 3 | 4 | 5
    ---- | - | - | - | - | -
    Hash Build First | 5816.0 | 5770.0 | 5629.0 | 5656.0 | 5745.0
    Sort-Merge | 7284.0 | 7233.0 | 6816.0 | 6861.0 | 7218.0
    Hash Build Second | 3802.0 | 3836.0 | 3629.0 | 3782.0 | 3804.0


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #4445: [FLINK-7310][core] always use the HybridMemorySegment

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/4445
  
    These changes look good to me!
    
    There is in fact a potential performance impact of this change. It would be cool to get an understanding of the potential performance impact of only using the HybridMemorySegment now.
    We could run something like a Hash Join Performance test with key/value pairs of String keys (which are the most performance sensitive to serialize / deserialize with individual byte operations) and see if this has a measurable impact there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #4445: [FLINK-7310][core] always use the HybridMemorySegment

Posted by StefanRRichter <gi...@git.apache.org>.
Github user StefanRRichter commented on the issue:

    https://github.com/apache/flink/pull/4445
  
    I think the implementation of the change is good, but the performance impact seems noticeable, at least in some cases. I think the additional bounds checking in the hybrid case shows. Out of curiosity I deactivated the index bounds checks and this closed all gaps between `HeapMemorySegment` and `HybridMemorySegment` in the benchmarks that @NicoK mentioned.
    
    If @StephanEwen has no concerns about the performance regression, I think this could be merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #4445: [FLINK-7310][core] always use the HybridMemorySegment

Posted by KurtYoung <gi...@git.apache.org>.
Github user KurtYoung commented on the issue:

    https://github.com/apache/flink/pull/4445
  
    I would bet on deserialization for it. And why sorter suffers more regression than hash join is that sorter will cause more deserializations during compare records than hash join.
    
    Despite the regression we will face, i think it's still worthy since we can avoid an extra copy from network to runtime. It's better if we can take the extra copy into account during benchmark, but it's ok we don't have it. 
    
    +1 to merge this.


---

[GitHub] flink issue #4445: [FLINK-7310][core] always use the HybridMemorySegment

Posted by NicoK <gi...@git.apache.org>.
Github user NicoK commented on the issue:

    https://github.com/apache/flink/pull/4445
  
    FYI: I just rebased this PR onto current `master` to make this mergable and support further extensions


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request #4445: [FLINK-7310][core] always use the HybridMemorySegm...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/flink/pull/4445


---

[GitHub] flink issue #4445: [FLINK-7310][core] always use the HybridMemorySegment

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/4445
  
    Thanks!
    
    I am currently trying to pinpoint what part of the code exactly suffers most from the regression. If that is for example specific to the microbenchmark, we can merge this without concern...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #4445: [FLINK-7310][core] always use the HybridMemorySegment

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/4445
  
    Agree with @KurtYoung.
    Merging this...


---