You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Andrey Zagrebin (Jira)" <ji...@apache.org> on 2020/01/24 18:21:00 UTC

[jira] [Comment Edited] (FLINK-14894) HybridOffHeapUnsafeMemorySegmentTest#testByteBufferWrap failed on Travis

    [ https://issues.apache.org/jira/browse/FLINK-14894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17023015#comment-17023015 ] 

Andrey Zagrebin edited comment on FLINK-14894 at 1/24/20 6:20 PM:
------------------------------------------------------------------

[~sewen] [~trohrmann] and me had an offline discussion.

The conclusion at the moment is that release unsafe memory, while potentially having link on it in Java code, is dangerous. We revert this to rely only on GC when there are no links in Java code. The problem can happen e.g. if task thread exits w/o joining with IO threads (e.g. spilling in batch job) then the unsafe memory is released but it can be written w/o segfault by IO thread. At the same time, other task can allocate interleaving memory which can be spoiled by that IO thread. We still keep it unsafe to allocate it outside of JVM direct memory limit to not interfere with direct allocations, also it does not make sense for RocksDB native memory (also accounted in MemoryManager) to be part of direct memory limit.

The potential downside can be that over-allocating of unsafe memory will not hit the direct limit and will not cause GC immediately which will be the only way to release it. In this case, it can cause out-of-memory failures w/o triggering GC to release a lot of potentially already unused memory.

If we see the delayed release as a problem then we can investigate further optimisations (FLINK-15758), like:
 * directly monitoring phantom reference queue of the cleaner (if JVM detects quickly that there are no more reference to the memory) and explicitly release memory ready for GC asap, e.g. after Task exit
 * monitor allocated memory amount and block allocation until GC releases occupied memory instead of failing with out-of-memory immediately


was (Author: azagrebin):
[~sewen] [~trohrmann] and me had an offline discussion.

The conclusion at the moment is that release unsafe memory, while potentially having link on it in Java code, is dangerous. We revert this to rely only on GC when there are no links in Java code. The problem can happen e.g. if task thread exits w/o joining with IO threads (e.g. spilling in batch job) then the unsafe memory is released but it can be written w/o segfault by IO thread. At the same time, other task can allocate interleaving memory which can be spoiled by that IO thread. We still keep it unsafe to allocate it outside of JVM direct memory limit to not interfere with direct allocations, also it does not make sense for RocksDB native memory (also accounted in MemoryManager) to be part of direct memory limit.

The potential downside can be that over-allocating of unsafe memory will not hit the direct limit and will not cause GC immediately which will be the only way to release it. In this case, it can cause out-of-memory failures w/o triggering GC to release a lot of potentially already unused memory.

If we see the delayed release as a problem then we can investigate further optimisations, like:
 * directly monitoring phantom reference queue of the cleaner (if JVM detects quickly that there are no more reference to the memory) and explicitly release memory ready for GC asap, e.g. after Task exit
 * monitor allocated memory amount and block allocation until GC releases occupied memory instead of failing with out-of-memory immediately

> HybridOffHeapUnsafeMemorySegmentTest#testByteBufferWrap failed on Travis
> ------------------------------------------------------------------------
>
>                 Key: FLINK-14894
>                 URL: https://issues.apache.org/jira/browse/FLINK-14894
>             Project: Flink
>          Issue Type: Bug
>          Components: Tests
>    Affects Versions: 1.10.0
>            Reporter: Gary Yao
>            Assignee: Andrey Zagrebin
>            Priority: Major
>              Labels: pull-request-available, test-stability
>             Fix For: 1.10.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> {noformat}
> HybridOffHeapUnsafeMemorySegmentTest>MemorySegmentTestBase.testByteBufferWrapping:2465 expected:<992288337> but was:<196608>
> {noformat}
> https://api.travis-ci.com/v3/job/258950527/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)