You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Gary Yao (JIRA)" <ji...@apache.org> on 2019/05/06 12:08:00 UTC

[jira] [Updated] (FLINK-12219) Yarn application can't stop when flink job failed in per-job yarn cluster mode

     [ https://issues.apache.org/jira/browse/FLINK-12219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gary Yao updated FLINK-12219:
-----------------------------
    Summary: Yarn application can't stop when flink job failed in per-job yarn cluster mode  (was: Yarn application can't stop when flink job failed in per-job yarn cluste mode)

> Yarn application can't stop when flink job failed in per-job yarn cluster mode
> ------------------------------------------------------------------------------
>
>                 Key: FLINK-12219
>                 URL: https://issues.apache.org/jira/browse/FLINK-12219
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN, Runtime / REST
>    Affects Versions: 1.6.3, 1.8.0
>            Reporter: lamber-ken
>            Assignee: lamber-ken
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: fix-bug.patch, image-2019-04-17-15-00-40-687.png, image-2019-04-17-15-02-49-513.png, image-2019-04-23-17-37-00-081.png
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> h3. *Issue detail info*
> In our flink(1.6.3) product env, I often encounter a scene that yarn application can't stop when flink job failed in per-job yarn cluste mode, so I deeply analyzed the reason why it happened.
> When a flink job fail, system will write an archive file to a FileSystem through +MiniDispatcher#archiveExecutionGraph+ method, then notify YarnJobClusterEntrypoint to shutDown. But, if +MiniDispatcher#archiveExecutionGraph+ throw exceptions during execution, it affect the following calls.
> So I open [FLINK-12247|https://issues.apache.org/jira/projects/FLINK/issues/FLINK-12247] to solve NEP bug when system write archive to FileSystem. But We still need to consider other exceptions, so we should catch Exception / Throwable not just IOExcetion.
> h3. *Flink yarn job fail flow*
> !image-2019-04-23-17-37-00-081.png!
> h3. *Flink yarn job fail on yarn*
> !image-2019-04-17-15-00-40-687.png!
>  
> h3. *Flink yarn application can't stop*
> !image-2019-04-17-15-02-49-513.png!
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)