You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Piotr Nowojski (Jira)" <ji...@apache.org> on 2021/10/18 13:24:00 UTC

[jira] [Commented] (FLINK-24401) TM cannot exit after Metaspace OOM

    [ https://issues.apache.org/jira/browse/FLINK-24401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17430011#comment-17430011 ] 

Piotr Nowojski commented on FLINK-24401:
----------------------------------------

[~fanrui], can you post the actual second's OOM stack trace?

As the error was caused by {{org.apache.flink.runtime.taskexecutor.TaskManagerRunner.Result}}, I wonder if we should actually just treat all OOMs the sam way, regardless if it's meta space or not. [~trohrmann], what do you think? I'm asking as you were doing the review of that change, and I don't know the motivation behind it. Was it just best effort thing that we added, just because we thought we could?
{quote}
In case of Metaspace OOM error, we try a graceful TM shutdown to notify JM because it is not expected to require class loading for that and cause further failures.
{quote}
After all this assumption is clearly in the wrong.

> TM cannot exit after Metaspace OOM
> ----------------------------------
>
>                 Key: FLINK-24401
>                 URL: https://issues.apache.org/jira/browse/FLINK-24401
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Runtime / Task
>    Affects Versions: 1.12.0, 1.13.0
>            Reporter: future
>            Priority: Major
>             Fix For: 1.14.1, 1.13.4
>
>         Attachments: image-2021-09-29-12-00-28-510.png, image-2021-09-29-12-00-44-812.png
>
>
> Hi masters, from the code and log, we can see that OOM will terminateJVM directly, but Metaspace OutOfMemoryError will graceful shutdown. The code comment mentions: {{_it does not usually require more class loading to fail again with the Metaspace OutOfMemoryError_.}}.
> But we encountered: after Metaspace OutOfMemoryError, {{_java.lang.NoClassDefFoundError: Could not initialize class org.apache.flink.runtime.taskexecutor.TaskManagerRunner$Result_.}}, makes Tm unable to exit, keeps trying again, keeps NoClassDefFoundError, keeps class loading failure, until kill tm by manually.
> I want to add a catch Throwable in the onFatalError method, and directly terminateJVM() in the catch. Is there any problem with this strategy? 
>  
> [code link |https://github.com/apache/flink/blob/4fe9f525a92319acc1e3434bebed601306f7a16f/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskManagerRunner.java#L312]
> picture:
>  
> !image-2021-09-29-12-00-44-812.png|width=1337,height=692!
>   !image-2021-09-29-12-00-28-510.png!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)