You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Kerem Ulutaş (Jira)" <ji...@apache.org> on 2022/01/01 11:31:00 UTC

[jira] [Commented] (FLINK-25023) ClassLoader leak on JM/TM through indirectly-started Hadoop threads out of user code

    [ https://issues.apache.org/jira/browse/FLINK-25023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17467404#comment-17467404 ] 

Kerem Ulutaş commented on FLINK-25023:
--------------------------------------

Hey all, thanks for your quick responses. I wish you a happy and healthy new year :)

[~lirui] - I can confirm the {{StatisticsDataReferenceCleaner}} thread is started just 1 time and no more - so I may have come to conclusion that even the 1 instance wouldn't leak after reading your comment, I'm also sorry about this.

[~dmvk] - I think I forgot to mention that I am running my setup in a minikube environment, submitting my batch job to my Flink standalone cluster with a {{flink run}} command (not sure if it counts as a web submission)

I checked out the PR and built a Flink image from it. Ran the batch job once and the {{StatisticsDataReferenceCleaner}} thread stays after the job is finished, like before. I took a heap dump and analyzed leak suspects report in Eclipse MAT - {{ChildFirstClassLoader}} is not in that report any more. PR does what it is intended to.

Since the leak is not caused by the user directly, I think it is fair enough to apply the fix, no matter how complex it is.

One more thing I wonder: we don't leak the class loader, we still have the {{StatisticsDataReferenceCleaner}} thread in taskmanager thread dump and metaspace usage isn't decreasing after job finishes. Flink standalone cluster with Hadoop libraries in classpath just can't be used this way as it is destined to throw metaspace oom exceptions some time between job submissions (and hopefully get restarted by minikube in my situation, but this doesn't feel comfortable) What needs to be done for it to decrease after job is finished?

> ClassLoader leak on JM/TM through indirectly-started Hadoop threads out of user code
> ------------------------------------------------------------------------------------
>
>                 Key: FLINK-25023
>                 URL: https://issues.apache.org/jira/browse/FLINK-25023
>             Project: Flink
>          Issue Type: Bug
>          Components: Connectors / FileSystem, Connectors / Hadoop Compatibility, FileSystems
>    Affects Versions: 1.14.0, 1.12.5, 1.13.3
>            Reporter: Nico Kruber
>            Assignee: David Morávek
>            Priority: Major
>              Labels: pull-request-available
>
> If a Flink job is using HDFS through Flink's filesystem abstraction (either on the JM or TM), that code may actually spawn a few threads, e.g. from static class members:
>  * {{org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner}}
>  * {{IPC Parameter Sending Thread#*}}
> These threads are started as soon as the classes are loaded which may be in the context of the user code. In this specific scenario, however, the created threads may contain references to the context class loader (I did not see that though) or, as happened here, it may inherit thread contexts such as the {{ProtectionDomain}} (from an {{{}AccessController{}}}).
> Hence user contexts and user class loaders are leaked into long-running threads that are run in Flink's (parent) classloader.
> Fortunately, it seems to only *leak a single* {{ChildFirstClassLoader}} in this concrete example but that may depend on which code paths each client execution is walking.
>  
> A *proper solution* doesn't seem so simple:
>  * We could try to proactively initialize available file systems in the hope to start all threads in the parent classloader with parent context.
>  * We could create a default {{ProtectionDomain}} for spawned threads as discussed at [https://dzone.com/articles/javalangoutofmemory-permgen], however, the {{StatisticsDataReferenceCleaner}} isn't actually actively spawned from any callback but as a static variable and this with the class loading itself (but maybe this is still possible somehow).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)