You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Piotr Nowojski (Jira)" <ji...@apache.org> on 2021/10/19 07:03:00 UTC
[jira] [Comment Edited] (FLINK-24401) TM cannot exit after Metaspace OOM

    [ https://issues.apache.org/jira/browse/FLINK-24401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17430011#comment-17430011 ] 

Piotr Nowojski edited comment on FLINK-24401 at 10/19/21, 7:02 AM:
-------------------------------------------------------------------

[~fanrui], can you post the actual second's OOM stack trace?

As the error was caused by {{org.apache.flink.runtime.taskexecutor.TaskManagerRunner.Result}}, I wonder if we should actually just treat all OOMs the sam way, regardless if it's meta space or not. [~trohrmann], what do you think? I'm asking as you were doing the review of that change, and I don't know the motivation behind it. Was it just best effort thing that we added, just because we thought we could? Or was it addressing some problem that users were complaining about?
{quote}
In case of Metaspace OOM error, we try a graceful TM shutdown to notify JM because it is not expected to require class loading for that and cause further failures.
{quote}
After all this assumption is clearly in the wrong.


was (Author: pnowojski):
[~fanrui], can you post the actual second's OOM stack trace?

As the error was caused by {{org.apache.flink.runtime.taskexecutor.TaskManagerRunner.Result}}, I wonder if we should actually just treat all OOMs the sam way, regardless if it's meta space or not. [~trohrmann], what do you think? I'm asking as you were doing the review of that change, and I don't know the motivation behind it. Was it just best effort thing that we added, just because we thought we could?
{quote}
In case of Metaspace OOM error, we try a graceful TM shutdown to notify JM because it is not expected to require class loading for that and cause further failures.
{quote}
After all this assumption is clearly in the wrong.

> TM cannot exit after Metaspace OOM
> ----------------------------------
>
>                 Key: FLINK-24401
>                 URL: https://issues.apache.org/jira/browse/FLINK-24401
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Runtime / Task
>    Affects Versions: 1.12.0, 1.13.0
>            Reporter: future
>            Priority: Major
>             Fix For: 1.14.1, 1.13.4
>
>         Attachments: image-2021-09-29-12-00-28-510.png, image-2021-09-29-12-00-44-812.png
>
>
> Hi masters, from the code and log, we can see that OOM will terminateJVM directly, but Metaspace OutOfMemoryError will graceful shutdown. The code comment mentions: {{_it does not usually require more class loading to fail again with the Metaspace OutOfMemoryError_.}}.
> But we encountered: after Metaspace OutOfMemoryError, {{_java.lang.NoClassDefFoundError: Could not initialize class org.apache.flink.runtime.taskexecutor.TaskManagerRunner$Result_.}}, makes Tm unable to exit, keeps trying again, keeps NoClassDefFoundError, keeps class loading failure, until kill tm by manually.
> I want to add a catch Throwable in the onFatalError method, and directly terminateJVM() in the catch. Is there any problem with this strategy? 
>  
> [code link |https://github.com/apache/flink/blob/4fe9f525a92319acc1e3434bebed601306f7a16f/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskManagerRunner.java#L312]
> picture:
>  
> !image-2021-09-29-12-00-44-812.png|width=1337,height=692!
>   !image-2021-09-29-12-00-28-510.png!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)