You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tez.apache.org by "Kuhu Shukla (JIRA)" <ji...@apache.org> on 2016/11/23 20:16:58 UTC

[jira] [Created] (TEZ-3549) TaskAttemptImpl does not initialize TEZ_TASK_PROGRESS_STUCK_INTERVAL_MS correctly

Kuhu Shukla created TEZ-3549:
--------------------------------

             Summary: TaskAttemptImpl does not initialize TEZ_TASK_PROGRESS_STUCK_INTERVAL_MS correctly
                 Key: TEZ-3549
                 URL: https://issues.apache.org/jira/browse/TEZ-3549
             Project: Apache Tez
          Issue Type: Bug
    Affects Versions: 0.7.1
            Reporter: Kuhu Shukla
            Assignee: Kuhu Shukla


{code}
hadoop jar /home/gs/tez/current/tez-examples-x.x.x.jar  orderedwordcount  -Dtez.task.progress.stuck.interval-ms=700000 /input  /output
{code}
With {{tez.task.am.heartbeat.interval-ms.max=3000}} and {{timeout-ms=300000}}
fails as follows:
{code}
 INFO client.DAGClientImpl: DAG completed. FinalState=FAILED
 INFO examples.OrderedWordCount: DAG diagnostics: [Vertex failed, vertexName=Tokenizer, vertexId=vertex_123_456_1_00, diagnostics=[Task failed, taskId=task_123_456_1_00_000007, diagnostics=[TaskAttempt 0 failed, info=[Attempt failed because it appears to make no progress for 700000ms], TaskAttempt 1 failed, info=[Attempt failed because it appears to make no progress for 700000ms], TaskAttempt 2 failed, info=[Attempt failed because it appears to make no progress for 700000ms], TaskAttempt 3 failed, info=[Attempt failed because it appears to make no progress for 700000ms]], Vertex did not succeed due to OWN_TASK_FAILURE, failedTasks:1 killedTasks:51, Vertex vertex_123_456_1_00 [Tokenizer] killed/failed due to:OWN_TASK_FAILURE], Vertex killed, vertexName=Sorter, vertexId=vertex_123_456_1_02, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:1, Vertex vertex_123_456_1_02 [Sorter] killed/failed due to:OTHER_VERTEX_FAILURE], Vertex killed, vertexName=Summation, vertexId=vertex_123_456_1_01, diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not succeed due to OTHER_VERTEX_FAILURE, failedTasks:0 killedTasks:1, Vertex vertex_123_456_1_01 [Summation] killed/failed due to:OTHER_VERTEX_FAILURE], DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:2]
{code}

This is because in TaskAttemptImpl, {{lastNotifyProgressTimestamp}} is 0 until progress gets notified and the following code always evaluates to true if this config is set/non-zero.
TaskUpdaterTransition:
{code}
      if (statusEvent.getProgressNotified()) {
        ta.lastNotifyProgressTimestamp = ta.clock.getTime();
      } else {
        long currTime = ta.clock.getTime();
        if (ta.hungIntervalMax > 0 &&
            currTime - ta.lastNotifyProgressTimestamp > ta.hungIntervalMax) {
{code}

Ideally lastNotifyProgressTimestamp should be a timestamp from the onset so that this comparison is valid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)