You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Till Rohrmann (JIRA)" <ji...@apache.org> on 2019/07/02 09:04:00 UTC

[jira] [Commented] (FLINK-12889) Job keeps in FAILING state

    [ https://issues.apache.org/jira/browse/FLINK-12889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876799#comment-16876799 ] 

Till Rohrmann commented on FLINK-12889:
---------------------------------------

Thanks for reporting this issue [~xinpu] and the analysis [~zjwang]. I think your analysis is correct. The problem is that we don't catch the last OOM when creating the task canceller thread. This should indeed kill the TM process because it is no longer guaranteed that this Flink process works correctly.

I would suggest to create the {{StreamTask#asyncOperationsThreadPool}} with a {{FatalExitExceptionHandler}} as an uncaught exception handler. This should cause the TM to exit in a situation you've described [~xinpu].

> Job keeps in FAILING state
> --------------------------
>
>                 Key: FLINK-12889
>                 URL: https://issues.apache.org/jira/browse/FLINK-12889
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Task
>    Affects Versions: 1.7.2, 1.8.1, 1.9.0
>            Reporter: Fan Xinpu
>            Priority: Critical
>             Fix For: 1.7.3, 1.8.2, 1.9.0
>
>         Attachments: 20190618104945417.jpg, jobmanager.log.2019-06-16.0, taskmanager.log
>
>
> There is a topology of 3 operator, such as, source, parser, and persist. Occasionally, 5 subtasks of the source encounters exception and turns to failed, at the same time, one subtask of the parser runs into exception and turns to failed too. The jobmaster gets a message of the parser's failed. The jobmaster then try to cancel all the subtask, most of the subtasks of the three operator turns to canceled except the 5 subtasks of the source, because the state of the 5 ones is already FAILED before jobmaster try to cancel it. Then the jobmaster can not reach a final state but keeps in  Failing state meanwhile the subtask of the source kees in canceling state. 
>  
> The job run on a flink 1.7 cluster on yarn, and there is only one tm with 10 slots.
>  
> The attached files contains a jm log , tm log and the ui picture.
>  
> The exception timestamp is about 2019-06-16 13:42:28.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)