You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Andrey Zagrebin (Jira)" <ji...@apache.org> on 2020/01/30 12:06:00 UTC
[jira] [Comment Edited] (FLINK-15758) Investigate potential out-of-memory problems due to managed unsafe memory allocation

    [ https://issues.apache.org/jira/browse/FLINK-15758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026626#comment-17026626 ] 

Andrey Zagrebin edited comment on FLINK-15758 at 1/30/20 12:05 PM:
-------------------------------------------------------------------

After looking more into this topic, the problem could be resolved with the following step:
 * Simply MemoryManager by removing KeyedBudgetManager as the managed memory is only on-heap now
 * Implement custom unsafe memory control, similar to what JVM does to control direct memory limit:
 ** Use AtomicLong to control allocated memory and optimistic allocation retry (like in nio.Bits#tryReserveMemory)
 ** Try to speed up running GC phantom ref cleaners with SharedSecrets.getJavaLangRefAccess and fallback to full GC if allocation fails before trowing OutOfMemoryError (like in nio.Bits#reserveMemory)

The last step will require wrapping of JavaLangRefAccess logic with the reflection calls as it was relocated in Java 9 and the API has changed.


was (Author: azagrebin):
After looking more into this topic, the problem could be resolved with the following step:
 * Simply MemoryManager by removing KeyedBudgetManager as the managed memory is only on-heap now
 * Implement custom unsafe memory control, similar to what JVM does to control direct memory limit:
 ** Use AtomicLong to control allocated memory and optimistic allocation retry (like in nio.Bits#tryReserveMemory)
 ** Try to speed up running GC phantom ref cleaners with SharedSecrets.getJavaLangRefAccess and fallback to full GC if allocation fails (like in nio.Bits#reserveMemory)

> Investigate potential out-of-memory problems due to managed unsafe memory allocation
> ------------------------------------------------------------------------------------
>
>                 Key: FLINK-15758
>                 URL: https://issues.apache.org/jira/browse/FLINK-15758
>             Project: Flink
>          Issue Type: Task
>          Components: API / DataSet, Runtime / Task
>            Reporter: Andrey Zagrebin
>            Assignee: Andrey Zagrebin
>            Priority: Critical
>             Fix For: 1.11.0
>
>
> After FLINK-13985, managed memory is allocated from UNSAFE, not as direct nio buffers as before 1.10.
> in FLINK-14894, there was an attempt to release this memory only when all Java handles of the unsafe memory are about to be GC'ed. It is similar to how it was with direct nio buffers before 1.10 but the unsafe memory is not tracked by direct memory limit (-XX:MaxDirectMemorySize). The problem is that over-allocating of unsafe memory will not hit the direct limit and will not cause GC immediately which will be the only way to release it. In this case, it causes out-of-memory failures w/o triggering GC to release a lot of potentially already unused memory.
> We have to investigate further optimisations, like:
>  * directly monitoring phantom reference queue of the cleaner (if JVM detects quickly that there are no more reference to the memory) and explicitly release memory ready for GC asap, e.g. after Task exit
>  * monitor allocated memory amount and block allocation until GC releases occupied memory instead of failing with out-of-memory immediately
> cc [~sewen] [~trohrmann]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)