You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Stephan Ewen (Jira)" <ji...@apache.org> on 2020/02/23 19:26:00 UTC

[jira] [Comment Edited] (FLINK-11205) Task Manager Metaspace Memory Leak

    [ https://issues.apache.org/jira/browse/FLINK-11205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17042986#comment-17042986 ] 

Stephan Ewen edited comment on FLINK-11205 at 2/23/20 7:25 PM:
---------------------------------------------------------------

Coming back to this issue (sorry for the delay). There are similar discussion in issue FLINK-16142, but the root cause seems to be a different one.

[~nawaidshamim] and [~fwiffo] is this still an issue for you, or have you found a way to solve this?

Some thoughts on how to approach this are:
* FLIP-49 configures metaspace by default, because Metaspace cleanup does not seem to happen unless the metaspace size limit is reached
* As a final safety-net, the TMs kill/restart themselves when the metaspace blows up FLINK-16225
* [~fwiffo] I think the object stream class cache uses weak references, so should not contribute to the class leak. But maybe that has not always been so in all Java versions.
* [~fwiffo] A generic mechanism to prevent leaks through ClassLoader caching (as in Apache Commons Logging) would be FLINK-16245 (use a delegating class loader where we drop the reference to the real one when closing it).
* There are other cases where libraries produce class leaks, we just identified the AWS SDK as a culprit with its metric admin beans: FLINK-16142

Also worth noting that this should by only relevant to "sessions" (clusters that accept dynamic jobs submission) and not to "single applications clusters" which do not use dynamic class loading at all.


was (Author: stephanewen):
Coming back to this issue (sorry for the delay). There are similar discussion in issue FLINK-16142, but the root cause seems to be a different one.

[~nawaidshamim] and [~fwiffo] is this still an issue for you, or have you found a way to solve this?

My current thoughts on hoe to solve this are:
* FLIP-49 configures metaspace by default, because Metaspace cleanup does not seem to happen unless the metaspace size limit is reached
* As a final safety-net, the TMs kill/restart themselves when the metaspace blows up FLINK-16225
* [~fwiffo] I think the object stream class cache uses weak references, so should not contribute to the class leak. But maybe that has not always been so in all Java versions.
* [~fwiffo] A generic mechanism to prevent leaks through ClassLoader caching (as in Apache Commons Logging) would be FLINK-16245 (use a delegating class loader where we drop the reference to the real one when closing it).
* There are other cases where libraries produce class leaks, we just identified the AWS SDK as a culprit with its metric admin beans: FLINK-16142

Also worth noting that this should by only relevant to "sessions" (clusters that accept dynamic jobs submission) and not to "single applications clusters" which do not use dynamic class loading at all.

> Task Manager Metaspace Memory Leak 
> -----------------------------------
>
>                 Key: FLINK-11205
>                 URL: https://issues.apache.org/jira/browse/FLINK-11205
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.5.5, 1.6.2, 1.7.0
>            Reporter: Nawaid Shamim
>            Priority: Critical
>         Attachments: Screenshot 2018-12-18 at 12.14.11.png, Screenshot 2018-12-18 at 15.47.55.png
>
>
> Job Restarts causes task manager to dynamically load duplicate classes. Metaspace is unbounded and grows with every restart. YARN aggressively kill such containers but this affect is immediately seems on different task manager which results in death spiral.
> Task Manager uses dynamic loader as described in [https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/debugging_classloading.html]
> {quote}
> *YARN*
> YARN classloading differs between single job deployments and sessions:
>  * When submitting a Flink job/application directly to YARN (via {{bin/flink run -m yarn-cluster ...}}), dedicated TaskManagers and JobManagers are started for that job. Those JVMs have both Flink framework classes and user code classes in the Java classpath. That means that there is _no dynamic classloading_ involved in that case.
>  * When starting a YARN session, the JobManagers and TaskManagers are started with the Flink framework classes in the classpath. The classes from all jobs that are submitted against the session are loaded dynamically.
> {quote}
> The above is not entirely true specially when you set {{-yD classloader.resolve-order=parent-first}} . We also above observed the above behaviour when submitting a Flink job/application directly to YARN (via {{bin/flink run -m yarn-cluster ...}}).
> !Screenshot 2018-12-18 at 12.14.11.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)