You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by David Clutter <dc...@yahooinc.com> on 2022/01/07 18:19:00 UTC

Re: [E] Re: Metaspace OOM : class loaders not being GC

Thanks for the responses.  I did switch to per-job mode and it is working
well of course.  I suspected there wouldn't be an easy solution, but I had
to ask.  Thanks!

On Fri, Jan 7, 2022 at 3:37 AM David Morávek <da...@gmail.com>
wrote:

> Hi David,
>
> If I understand the problem correctly, there is really nothing we can do
> here. Soft references are garbage collected when there is a high memory
> pressure and the garbage collector needs to free up more memory. The
> problem here is that the GC doesn't really take high memory pressure on
> Metaspace into the account here.
>
> I guess you might try to tweak _SoftRefLRUPolicyMSPerMB_ [1], but this
> might have some other consequences. Also this behavior might be highly
> dependent on the garbage collector you're using.
>
>
> From the docs [1]:
>
> -XX:SoftRefLRUPolicyMSPerMB=*time*
>
> Sets the amount of time (in milliseconds) a softly reachable object is
> kept active on the heap after the last time it was referenced. The default
> value is one second of lifetime per free megabyte in the heap. The
> -XX:SoftRefLRUPolicyMSPerMB option accepts integer values representing
> milliseconds per one megabyte of the current heap size (for Java HotSpot
> Client VM) or the maximum possible heap size (for Java HotSpot Server VM).
> This difference means that the Client VM tends to flush soft references
> rather than grow the heap, whereas the Server VM tends to grow the heap
> rather than flush soft references. In the latter case, the value of the
> -Xmx option has a significant effect on how quickly soft references are
> garbage collected.
>
> The following example shows how to set the value to 2.5 seconds:
>
> -XX:SoftRefLRUPolicyMSPerMB=2500
>
>
>
> [1] https://docs.oracle.com/javase/8/docs/technotes/tools/unix/java.html
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.oracle.com_javase_8_docs_technotes_tools_unix_java.html&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=N4km4Rf3oIx3rDuEmXYriVDklMwDUWP1cZmHhRiSF9M&m=2fXnoO8MKBq79Iwx9_KPG9ercos_tZ_wsxqJ6YgvvApwWcwPix5uLZIH8h3okNS5&s=f8bGwijTgHBdQ61s0koxadm3EGFVJhN-uTSF_hSNX_w&e=>
>
> Best,
> D.
>
> On Thu, Jan 6, 2022 at 3:13 AM Caizhi Weng <ts...@gmail.com> wrote:
>
>> Hi!
>>
>> As far as I remember this is a known issue a few years ago but Flink
>> currently has no solution to this (correct me if I'm wrong). I see that
>> you're running jobs on a yarn session. Could you switch to yarn-per-job
>> mode (where JM and TMs are created and destroyed for each job) for a
>> workaround?
>>
>> David Clutter <dc...@yahooinc.com> 于2022年1月4日周二 23:39写道:
>>
>>> I am seeing an issue with class loaders not being GCed and the metaspace
>>> eventually OOM.  Here is my setup:
>>>
>>> - Flink 1.13.1 on EMR using JDK 8 in session mode
>>> - Job manager is a long-running yarn session
>>> - New jobs are submitted every 5m (and typically run for less than 5m)
>>>
>>> I find that after a few hours the job manager gets killed with Metaspace
>>> OOM.  I tried increasing the Metaspace for the job manager but that only
>>> delays the OOM.
>>>
>>> I did some debugging using jcmd and I noticed that the size of the
>>> classes loaded is always increasing.  Next I did a heap dump and found that
>>> instances of org.apache.flink.util.ChildFirstClassLoader are present
>>> long after the jobs complete.  Checking the GC roots I found that there is
>>> a reference in java.io.ObjectStreamClass$Caches.  Seems to be this JDK
>>> issue: https://bugs.openjdk.java.net/browse/JDK-8277072
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.openjdk.java.net_browse_JDK-2D8277072&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=N4km4Rf3oIx3rDuEmXYriVDklMwDUWP1cZmHhRiSF9M&m=2fXnoO8MKBq79Iwx9_KPG9ercos_tZ_wsxqJ6YgvvApwWcwPix5uLZIH8h3okNS5&s=W4jBIDaDDNV1dFK9jTlmX_KxS0r2KG2JXjIxFlgQ4XY&e=>
>>>
>>> Curious if there are any workarounds for this situation?
>>>
>>>