You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tez.apache.org by "László Bodor (Jira)" <ji...@apache.org> on 2022/01/03 07:41:00 UTC

[jira] [Resolved] (TEZ-4349) DAGClient gets stuck with invalid cached DAGStatus

     [ https://issues.apache.org/jira/browse/TEZ-4349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

László Bodor resolved TEZ-4349.
-------------------------------
    Resolution: Fixed

> DAGClient gets stuck with invalid cached DAGStatus
> --------------------------------------------------
>
>                 Key: TEZ-4349
>                 URL: https://issues.apache.org/jira/browse/TEZ-4349
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Ahmed Hussein
>            Assignee: Ahmed Hussein
>            Priority: Major
>             Fix For: 0.10.2
>
>          Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> I found that some Oozie launchers get stuck waiting for the job to complete.
> After investigation I found that {{dagClient.getDAGStatus(null)}} calls the override {{dagClient.getDAGStatus(null, 0)}} , which then calls {{getDAGStatusInternal}} making use of the cachedDagStatus field.
> The cachedDagStatus is never updated causing the launcher to wait indefinitely.
>  [https://github.com/apache/tez/blob/master/tez-api/src/main/java/org/apache/tez/dag/api/client/DAGClientImpl.java#L212]
> {code:java}
>       if (!dagCompleted) {
>         if (dagStatus != null) {
>           cachedDagStatus = dagStatus;
>           return dagStatus;
>         }
>         if (cachedDagStatus != null) {
>           // could not get from AM (not reachable/ was killed). return cached status.
>           return cachedDagStatus;
>         }
>       }
> {code}
> +To Fix:+
>  The {{cachedDagStatus}} should be valid for a certain amount of time, or certain number of retires.
> When the cachedDAGStatus expires, the DAGClient tries to pull from AM or the RM.
> An error in fetching the status from both AM and RM, would return null to the caller.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)