You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Mudit Sharma (Jira)" <ji...@apache.org> on 2023/02/13 04:13:00 UTC

[jira] [Comment Edited] (TEZ-4474) DAG recovery failure leads to AM status SUCCEEDED

    [ https://issues.apache.org/jira/browse/TEZ-4474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17687707#comment-17687707 ] 

Mudit Sharma edited comment on TEZ-4474 at 2/13/23 4:12 AM:
------------------------------------------------------------

[~srahman] , thanks for reviewing the patch, raised PR: https://github.com/apache/tez/pull/266

Also, on why Tez session is set to IDLE:

This is happening because of this if condition: [https://github.com/apache/tez/blob/master/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java#L1987]

In second attempt when isSession is set to true, then it goes into this if-else and sets Status to IDLE, I have seen the Log statement pertaining to this else in my AM output: In Session mode. Waiting for DAG over RPC

Although we are using version 0.9.2 but this condition seems to be unchanged till now


was (Author: mudit-97):
[~srahman] , thanks for reviewing the patch, I was not able to raise a PR, thats why I raised a patch

Also, on why Tez session is set to IDLE:

This is happening because of this if condition: [https://github.com/apache/tez/blob/master/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java#L1987]

In second attempt when isSession is set to true, then it goes into this if-else and sets Status to IDLE, I have seen the Log statement pertaining to this else in my AM output: In Session mode. Waiting for DAG over RPC

Although we are using version 0.9.2 but this condition seems to be unchanged till now

> DAG recovery failure leads to AM status SUCCEEDED
> -------------------------------------------------
>
>                 Key: TEZ-4474
>                 URL: https://issues.apache.org/jira/browse/TEZ-4474
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.9.2, 0.10.0, 0.10.1, 0.10.2
>            Reporter: Mudit Sharma
>            Priority: Critical
>         Attachments: 0001-TEZ-4474-Added-config-to-fail-the-DAG-status-when-sh.patch
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Summary of the Issue:
> When Tez DAG recovery is failed because of some reason in the second retry of any Tez AM, then in corner case scenario, Tez Job sets DAG state to IDLE
> Once the DAG state is set to IDLE, then after checkAndHandleSessionTimeout(), Tez AM will try to shutdown the DAG, and since recovery was failed so there will not be any running DAGs
> If there are no RUNNING DAGs and state of DAG is IDLE, then by default AM sets the status to SUCCEEDED, because of this if-else:
> [https://github.com/apache/tez/blob/master/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java#L1266]
> {code}
> public void shutdownTezAM(String dagKillmessage) throws TezException {
>     if (!sessionStopped.compareAndSet(false, true))
> {       // No need to shutdown twice.       // Return with a no-op if shutdownTezAM has been invoked earlier.       return;     }
>     synchronized (this) {
>       this.taskSchedulerManager.setShouldUnregisterFlag();
>       if (currentDAG != null
>           && !currentDAG.isComplete())
> {         //send a DAG_TERMINATE message         LOG.info("Sending a kill event to the current DAG"             + ", dagId=" + currentDAG.getID());         tryKillDAG(currentDAG, dagKillmessage);       }
> else {
>         LOG.info("No current running DAG, shutting down the AM");
>         if (isSession && !state.equals(DAGAppMasterState.ERROR))
> {           state = DAGAppMasterState.SUCCEEDED;         }
>         shutdownHandler.shutdown();
>       }
>     }
>   }
> {code}
>  
> This can result in issues in dependent systems like Hive which will move ahead with other tasks in pipeline assuming the DAG was success, this can result in moving empty data in Hive
> As part of this JIRA, we are proposing to introduce a patch in TEZ, which introduces a config, which when set, then in case of shutdown with no current running DAGs, Tez status will always be marked as FAILED instead of SUCCEEDED in case DAG state at that time was not ERROR
>  
> This is the patch, please review and let us know about your thoughts: [^0001-TEZ-4474-Added-config-to-fail-the-DAG-status-when-sh.patch]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)