You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by GitBox <gi...@apache.org> on 2021/11/15 17:52:20 UTC

[GitHub] [tez] amahussein commented on a change in pull request #161: TEZ-4349. DAGClient gets stuck with invalid cached DAGStatus

amahussein commented on a change in pull request #161:
URL: https://github.com/apache/tez/pull/161#discussion_r749556005



##########
File path: tez-api/src/main/java/org/apache/tez/dag/api/client/DAGClientImpl.java
##########
@@ -221,13 +245,14 @@ protected DAGStatus getDAGStatusInternal(@Nullable Set<StatusGetOpts> statusOpti
       final DAGStatus dagStatus = getDAGStatusViaAM(statusOptions, timeout);
 
       if (!dagCompleted) {
-        if (dagStatus != null) {
-          cachedDagStatus = dagStatus;
+        if (dagStatus != null) { // update the cached DAGStatus
+          cachedDAGStatusRef.setValue(dagStatus);
           return dagStatus;
         }
-        if (cachedDagStatus != null) {
+        DAGStatus cachedDAG = cachedDAGStatusRef.getValue();
+        if (cachedDAG != null) {
           // could not get from AM (not reachable/ was killed). return cached status.
-          return cachedDagStatus;
+          return cachedDAG;

Review comment:
       Thanks for the feedback.
   
   Yes, considering the implementation of [TezJob.run()-Line222](https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezJob.java#L222) in Pig:
   
   Pig polls on the DAGStatus inside an infinite loop:
   ```java
           while (true) {
               try {
                   dagStatus = dagClient.getDAGStatus(null);
               } catch (Exception e) {
                   log.info("Cannot retrieve DAG status", e);
                   break;
               }
              if (dagStatus.isCompleted()) {
                  // do something
                  // break;
              }
              sleep(1000);
          }  
   ```
   
   
   Let's assume the following scenario on Tez Side:
   - Pig first iteration calls `getDAGStatusViaAM()` which successfully pulls the DAGStatus and updates the cachedDAGStatus to running.
   - Pig sleeps 1000
   - second call from Pig calls `getDAGStatusViaAM()` which encounters `TezException` or `IOException`. The call would return the last cachedDAGStatus (which is running), instead of null.
   - Since the status is `running`, the Pig-thread sleeps
   - This will keep going as long as the `getDAGStatusViaAM()` fails, and the last valid DAGStatus is still cached.
   
   The problem in this corner case is that the Pig client will keep looping indefinitely as long as it does not receive a null or dagClient.getDAGStatus(null) does not throw an exception.
   From a client perspective, it is better to fail early in order to recover faster.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@tez.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org