You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "zhijiang (JIRA)" <ji...@apache.org> on 2019/06/26 05:32:00 UTC

[jira] [Commented] (FLINK-12889) Job keeps in FAILING state

    [ https://issues.apache.org/jira/browse/FLINK-12889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872933#comment-16872933 ] 

zhijiang commented on FLINK-12889:
----------------------------------

My previous analysis in ML was as follows:
 
Actually all the five "Source: ServiceLog" tasks are not in terminal state on JM view, the relevant processes shown in below: * The checkpoint in task causes OOM issue which would call `Task#failExternally` as a result, we could see the log "Attempting to fail task externally" in tm.
 * The source task would transform state from RUNNING to FAILED and then starts a canceler thread for canceling task, we could see log "Triggering cancellation of task" in tm.
 * When JM starts to cancel the source tasks, the rpc call `Task#cancelExecution` would find the task was already in FAILED state as above step 2, we could see log "Attempting to cancel task" in tm.

 
At last all the five source tasks are not in terminal states from jm log, I guess the step 2 might not create canceler thread successfully, because the root failover was caused by OOM during creating native thread in step1, so it might exist possibilities that createing canceler thread is not successful as well in OOM case which is unstable. If so, the source task would not been interrupted at all, then it would not report to JM as well, but the state is already changed to FAILED before. 
 
For the other vertex tasks, it does not trigger `Task#failExternally` in step 1, and only receives the cancel rpc from JM in step 3. And I guess at this time later than the source period, the canceler thread could be created succesfully after some GCs, then these tasks could be canceled as reported to JM side.
 
I think the key problem is under OOM case some behaviors are not within expectations, so it might bring problems. Maybe we should handle OOM error in extreme way like making TM exit to solve the potential issue.
 
[~till.rohrmann] do you think it is worth fixing or have other concerns?

> Job keeps in FAILING state
> --------------------------
>
>                 Key: FLINK-12889
>                 URL: https://issues.apache.org/jira/browse/FLINK-12889
>             Project: Flink
>          Issue Type: Bug
>          Components: API / DataStream
>    Affects Versions: 1.7.2
>            Reporter: Fan Xinpu
>            Priority: Blocker
>         Attachments: 20190618104945417.jpg, jobmanager.log.2019-06-16.0, taskmanager.log
>
>
> There is a topology of 3 operator, such as, source, parser, and persist. Occasionally, 5 subtasks of the source encounters exception and turns to failed, at the same time, one subtask of the parser runs into exception and turns to failed too. The jobmaster gets a message of the parser's failed. The jobmaster then try to cancel all the subtask, most of the subtasks of the three operator turns to canceled except the 5 subtasks of the source, because the state of the 5 ones is already FAILED before jobmaster try to cancel it. Then the jobmaster can not reach a final state but keeps in  Failing state meanwhile the subtask of the source kees in canceling state. 
>  
> The job run on a flink 1.7 cluster on yarn, and there is only one tm with 10 slots.
>  
> The attached files contains a jm log , tm log and the ui picture.
>  
> The exception timestamp is about 2019-06-16 13:42:28.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)