You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Till Rohrmann (Jira)" <ji...@apache.org> on 2020/08/28 10:01:00 UTC

[jira] [Commented] (FLINK-14268) YARN AM endless restarts when using wrong checkpoint path or wrong checkpoint

    [ https://issues.apache.org/jira/browse/FLINK-14268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186440#comment-17186440 ] 

Till Rohrmann commented on FLINK-14268:
---------------------------------------

[~葛聂] have you tried whether the problem still occurs with the latest Flink version?

> YARN AM endless restarts when using wrong checkpoint path or wrong checkpoint
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-14268
>                 URL: https://issues.apache.org/jira/browse/FLINK-14268
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN
>    Affects Versions: 1.7.2
>         Environment: Flink: 1.7.2
> Deloyment: YARN Per Job
> YARN:2.7.2
> State backend:FSStateBackend with HDFS 
>  
>            Reporter: Lsw_aka_laplace
>            Priority: Major
>
> I tried to start a  streaming task and restore from checkpoint which it was stored in HDFS. 
> I set a wrong checkpoint path and sth unexpected happened: YARN AM restarted again and again.  Since we have already set some restart strategy to prevent endless restart, it should have been restarted with limited times.
> Since we made sure that restart strategy works, we dived into source code and did some change mainly in _ClusterEntrypoint_.
>  
> {code:java}
> //代码占位符
> //before 
> @Override
> public void onFatalError(Throwable exception) {
>    LOG.error("Fatal error occurred in the cluster entrypoint.", exception);
>    System.exit(RUNTIME_FAILURE_RETURN_CODE);
> }
> //after 
> @Override
> public void onFatalError(Throwable exception) {
>    LOG.error("Fatal error occurred in the cluster entrypoint.", exception);
>  if(ExceptionUtils.findThrowable(exception,PerJobFatalException.class).isPresent()){
> //PerJobFatalException is the FLAG 
> //在perjob模式有些致命的异常出现，am会一直重启，不能失败掉
>       LOG.error("perjob fatal error");
>       System.exit(STARTUP_FAILURE_RETURN_CODE);
>    }
>    System.exit(RUNTIME_FAILURE_RETURN_CODE);
> }
> {code}
>  We forced to make the FAILURE_RETURN_CODE as STARTUP_FAILURE_RETURN_CODE rather than RUNTIME_FAILURE_RETURN_CODE in some condition and *it DID WORK*.
>  
>  
> After discussing with [~Tison],  I knew that FAILURE_RETURN_CODE seems only to be used to debug, so I submitted this issue and look forward to ANY solution~
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)