You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by David Clutter <dc...@yahooinc.com> on 2022/01/04 15:38:53 UTC

Metaspace OOM : class loaders not being GC

I am seeing an issue with class loaders not being GCed and the metaspace
eventually OOM.  Here is my setup:

- Flink 1.13.1 on EMR using JDK 8 in session mode
- Job manager is a long-running yarn session
- New jobs are submitted every 5m (and typically run for less than 5m)

I find that after a few hours the job manager gets killed with Metaspace
OOM.  I tried increasing the Metaspace for the job manager but that only
delays the OOM.

I did some debugging using jcmd and I noticed that the size of the classes
loaded is always increasing.  Next I did a heap dump and found that
instances of org.apache.flink.util.ChildFirstClassLoader are present long
after the jobs complete.  Checking the GC roots I found that there is a
reference in java.io.ObjectStreamClass$Caches.  Seems to be this JDK issue:
https://bugs.openjdk.java.net/browse/JDK-8277072

Curious if there are any workarounds for this situation?

Re: [E] Re: Metaspace OOM : class loaders not being GC

Posted by David Clutter <dc...@yahooinc.com>.
Thanks for the responses.  I did switch to per-job mode and it is working
well of course.  I suspected there wouldn't be an easy solution, but I had
to ask.  Thanks!

On Fri, Jan 7, 2022 at 3:37 AM David Morávek <da...@gmail.com>
wrote:

> Hi David,
>
> If I understand the problem correctly, there is really nothing we can do
> here. Soft references are garbage collected when there is a high memory
> pressure and the garbage collector needs to free up more memory. The
> problem here is that the GC doesn't really take high memory pressure on
> Metaspace into the account here.
>
> I guess you might try to tweak _SoftRefLRUPolicyMSPerMB_ [1], but this
> might have some other consequences. Also this behavior might be highly
> dependent on the garbage collector you're using.
>
>
> From the docs [1]:
>
> -XX:SoftRefLRUPolicyMSPerMB=*time*
>
> Sets the amount of time (in milliseconds) a softly reachable object is
> kept active on the heap after the last time it was referenced. The default
> value is one second of lifetime per free megabyte in the heap. The
> -XX:SoftRefLRUPolicyMSPerMB option accepts integer values representing
> milliseconds per one megabyte of the current heap size (for Java HotSpot
> Client VM) or the maximum possible heap size (for Java HotSpot Server VM).
> This difference means that the Client VM tends to flush soft references
> rather than grow the heap, whereas the Server VM tends to grow the heap
> rather than flush soft references. In the latter case, the value of the
> -Xmx option has a significant effect on how quickly soft references are
> garbage collected.
>
> The following example shows how to set the value to 2.5 seconds:
>
> -XX:SoftRefLRUPolicyMSPerMB=2500
>
>
>
> [1] https://docs.oracle.com/javase/8/docs/technotes/tools/unix/java.html
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.oracle.com_javase_8_docs_technotes_tools_unix_java.html&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=N4km4Rf3oIx3rDuEmXYriVDklMwDUWP1cZmHhRiSF9M&m=2fXnoO8MKBq79Iwx9_KPG9ercos_tZ_wsxqJ6YgvvApwWcwPix5uLZIH8h3okNS5&s=f8bGwijTgHBdQ61s0koxadm3EGFVJhN-uTSF_hSNX_w&e=>
>
> Best,
> D.
>
> On Thu, Jan 6, 2022 at 3:13 AM Caizhi Weng <ts...@gmail.com> wrote:
>
>> Hi!
>>
>> As far as I remember this is a known issue a few years ago but Flink
>> currently has no solution to this (correct me if I'm wrong). I see that
>> you're running jobs on a yarn session. Could you switch to yarn-per-job
>> mode (where JM and TMs are created and destroyed for each job) for a
>> workaround?
>>
>> David Clutter <dc...@yahooinc.com> 于2022年1月4日周二 23:39写道:
>>
>>> I am seeing an issue with class loaders not being GCed and the metaspace
>>> eventually OOM.  Here is my setup:
>>>
>>> - Flink 1.13.1 on EMR using JDK 8 in session mode
>>> - Job manager is a long-running yarn session
>>> - New jobs are submitted every 5m (and typically run for less than 5m)
>>>
>>> I find that after a few hours the job manager gets killed with Metaspace
>>> OOM.  I tried increasing the Metaspace for the job manager but that only
>>> delays the OOM.
>>>
>>> I did some debugging using jcmd and I noticed that the size of the
>>> classes loaded is always increasing.  Next I did a heap dump and found that
>>> instances of org.apache.flink.util.ChildFirstClassLoader are present
>>> long after the jobs complete.  Checking the GC roots I found that there is
>>> a reference in java.io.ObjectStreamClass$Caches.  Seems to be this JDK
>>> issue: https://bugs.openjdk.java.net/browse/JDK-8277072
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.openjdk.java.net_browse_JDK-2D8277072&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=N4km4Rf3oIx3rDuEmXYriVDklMwDUWP1cZmHhRiSF9M&m=2fXnoO8MKBq79Iwx9_KPG9ercos_tZ_wsxqJ6YgvvApwWcwPix5uLZIH8h3okNS5&s=W4jBIDaDDNV1dFK9jTlmX_KxS0r2KG2JXjIxFlgQ4XY&e=>
>>>
>>> Curious if there are any workarounds for this situation?
>>>
>>>

Re: Metaspace OOM : class loaders not being GC

Posted by David Morávek <da...@gmail.com>.
Hi David,

If I understand the problem correctly, there is really nothing we can do
here. Soft references are garbage collected when there is a high memory
pressure and the garbage collector needs to free up more memory. The
problem here is that the GC doesn't really take high memory pressure on
Metaspace into the account here.

I guess you might try to tweak _SoftRefLRUPolicyMSPerMB_ [1], but this
might have some other consequences. Also this behavior might be highly
dependent on the garbage collector you're using.


From the docs [1]:

-XX:SoftRefLRUPolicyMSPerMB=*time*

Sets the amount of time (in milliseconds) a softly reachable object is kept
active on the heap after the last time it was referenced. The default value
is one second of lifetime per free megabyte in the heap. The
-XX:SoftRefLRUPolicyMSPerMB option accepts integer values representing
milliseconds per one megabyte of the current heap size (for Java HotSpot
Client VM) or the maximum possible heap size (for Java HotSpot Server VM).
This difference means that the Client VM tends to flush soft references
rather than grow the heap, whereas the Server VM tends to grow the heap
rather than flush soft references. In the latter case, the value of the -Xmx
option has a significant effect on how quickly soft references are garbage
collected.

The following example shows how to set the value to 2.5 seconds:

-XX:SoftRefLRUPolicyMSPerMB=2500



[1] https://docs.oracle.com/javase/8/docs/technotes/tools/unix/java.html

Best,
D.

On Thu, Jan 6, 2022 at 3:13 AM Caizhi Weng <ts...@gmail.com> wrote:

> Hi!
>
> As far as I remember this is a known issue a few years ago but Flink
> currently has no solution to this (correct me if I'm wrong). I see that
> you're running jobs on a yarn session. Could you switch to yarn-per-job
> mode (where JM and TMs are created and destroyed for each job) for a
> workaround?
>
> David Clutter <dc...@yahooinc.com> 于2022年1月4日周二 23:39写道:
>
>> I am seeing an issue with class loaders not being GCed and the metaspace
>> eventually OOM.  Here is my setup:
>>
>> - Flink 1.13.1 on EMR using JDK 8 in session mode
>> - Job manager is a long-running yarn session
>> - New jobs are submitted every 5m (and typically run for less than 5m)
>>
>> I find that after a few hours the job manager gets killed with Metaspace
>> OOM.  I tried increasing the Metaspace for the job manager but that only
>> delays the OOM.
>>
>> I did some debugging using jcmd and I noticed that the size of the
>> classes loaded is always increasing.  Next I did a heap dump and found that
>> instances of org.apache.flink.util.ChildFirstClassLoader are present
>> long after the jobs complete.  Checking the GC roots I found that there is
>> a reference in java.io.ObjectStreamClass$Caches.  Seems to be this JDK
>> issue: https://bugs.openjdk.java.net/browse/JDK-8277072
>>
>> Curious if there are any workarounds for this situation?
>>
>>

Re: Metaspace OOM : class loaders not being GC

Posted by Caizhi Weng <ts...@gmail.com>.
Hi!

As far as I remember this is a known issue a few years ago but Flink
currently has no solution to this (correct me if I'm wrong). I see that
you're running jobs on a yarn session. Could you switch to yarn-per-job
mode (where JM and TMs are created and destroyed for each job) for a
workaround?

David Clutter <dc...@yahooinc.com> 于2022年1月4日周二 23:39写道:

> I am seeing an issue with class loaders not being GCed and the metaspace
> eventually OOM.  Here is my setup:
>
> - Flink 1.13.1 on EMR using JDK 8 in session mode
> - Job manager is a long-running yarn session
> - New jobs are submitted every 5m (and typically run for less than 5m)
>
> I find that after a few hours the job manager gets killed with Metaspace
> OOM.  I tried increasing the Metaspace for the job manager but that only
> delays the OOM.
>
> I did some debugging using jcmd and I noticed that the size of the classes
> loaded is always increasing.  Next I did a heap dump and found that
> instances of org.apache.flink.util.ChildFirstClassLoader are present long
> after the jobs complete.  Checking the GC roots I found that there is a
> reference in java.io.ObjectStreamClass$Caches.  Seems to be this JDK
> issue: https://bugs.openjdk.java.net/browse/JDK-8277072
>
> Curious if there are any workarounds for this situation?
>
>