You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Gary Yao (JIRA)" <ji...@apache.org> on 2019/05/06 12:08:00 UTC
[jira] [Updated] (FLINK-12219) Yarn application can't stop when
flink job failed in per-job yarn cluster mode
[ https://issues.apache.org/jira/browse/FLINK-12219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gary Yao updated FLINK-12219:
-----------------------------
Summary: Yarn application can't stop when flink job failed in per-job yarn cluster mode (was: Yarn application can't stop when flink job failed in per-job yarn cluste mode)
> Yarn application can't stop when flink job failed in per-job yarn cluster mode
> ------------------------------------------------------------------------------
>
> Key: FLINK-12219
> URL: https://issues.apache.org/jira/browse/FLINK-12219
> Project: Flink
> Issue Type: Bug
> Components: Deployment / YARN, Runtime / REST
> Affects Versions: 1.6.3, 1.8.0
> Reporter: lamber-ken
> Assignee: lamber-ken
> Priority: Major
> Labels: pull-request-available
> Attachments: fix-bug.patch, image-2019-04-17-15-00-40-687.png, image-2019-04-17-15-02-49-513.png, image-2019-04-23-17-37-00-081.png
>
> Time Spent: 50m
> Remaining Estimate: 0h
>
> h3. *Issue detail info*
> In our flink(1.6.3) product env, I often encounter a scene that yarn application can't stop when flink job failed in per-job yarn cluste mode, so I deeply analyzed the reason why it happened.
> When a flink job fail, system will write an archive file to a FileSystem through +MiniDispatcher#archiveExecutionGraph+ method, then notify YarnJobClusterEntrypoint to shutDown. But, if +MiniDispatcher#archiveExecutionGraph+ throw exceptions during execution, it affect the following calls.
> So I open [FLINK-12247|https://issues.apache.org/jira/projects/FLINK/issues/FLINK-12247] to solve NEP bug when system write archive to FileSystem. But We still need to consider other exceptions, so we should catch Exception / Throwable not just IOExcetion.
> h3. *Flink yarn job fail flow*
> !image-2019-04-23-17-37-00-081.png!
> h3. *Flink yarn job fail on yarn*
> !image-2019-04-17-15-00-40-687.png!
>
> h3. *Flink yarn application can't stop*
> !image-2019-04-17-15-02-49-513.png!
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)