You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Till Rohrmann (Jira)" <ji...@apache.org> on 2020/08/28 10:01:00 UTC
[jira] [Commented] (FLINK-14268) YARN AM endless restarts when
using wrong checkpoint path or wrong checkpoint
[ https://issues.apache.org/jira/browse/FLINK-14268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186440#comment-17186440 ]
Till Rohrmann commented on FLINK-14268:
---------------------------------------
[~葛聂] have you tried whether the problem still occurs with the latest Flink version?
> YARN AM endless restarts when using wrong checkpoint path or wrong checkpoint
> -----------------------------------------------------------------------------
>
> Key: FLINK-14268
> URL: https://issues.apache.org/jira/browse/FLINK-14268
> Project: Flink
> Issue Type: Bug
> Components: Deployment / YARN
> Affects Versions: 1.7.2
> Environment: Flink: 1.7.2
> Deloyment: YARN Per Job
> YARN:2.7.2
> State backend:FSStateBackend with HDFS
>
> Reporter: Lsw_aka_laplace
> Priority: Major
>
> I tried to start a streaming task and restore from checkpoint which it was stored in HDFS.
> I set a wrong checkpoint path and sth unexpected happened: YARN AM restarted again and again. Since we have already set some restart strategy to prevent endless restart, it should have been restarted with limited times.
> Since we made sure that restart strategy works, we dived into source code and did some change mainly in _ClusterEntrypoint_.
>
> {code:java}
> //代码占位符
> //before
> @Override
> public void onFatalError(Throwable exception) {
> LOG.error("Fatal error occurred in the cluster entrypoint.", exception);
> System.exit(RUNTIME_FAILURE_RETURN_CODE);
> }
> //after
> @Override
> public void onFatalError(Throwable exception) {
> LOG.error("Fatal error occurred in the cluster entrypoint.", exception);
> if(ExceptionUtils.findThrowable(exception,PerJobFatalException.class).isPresent()){
> //PerJobFatalException is the FLAG
> //在perjob模式有些致命的异常出现,am会一直重启,不能失败掉
> LOG.error("perjob fatal error");
> System.exit(STARTUP_FAILURE_RETURN_CODE);
> }
> System.exit(RUNTIME_FAILURE_RETURN_CODE);
> }
> {code}
> We forced to make the FAILURE_RETURN_CODE as STARTUP_FAILURE_RETURN_CODE rather than RUNTIME_FAILURE_RETURN_CODE in some condition and *it DID WORK*.
>
>
> After discussing with [~Tison], I knew that FAILURE_RETURN_CODE seems only to be used to debug, so I submitted this issue and look forward to ANY solution~
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)