You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Hitesh Shah (JIRA)" <ji...@apache.org> on 2015/06/02 01:38:17 UTC

[jira] [Created] (TEZ-2509) YarnTaskSchedulerService should not try to allocate containers if AM is shutting down

Hitesh Shah created TEZ-2509:
--------------------------------

             Summary: YarnTaskSchedulerService should not try to allocate containers if AM is shutting down
                 Key: TEZ-2509
                 URL: https://issues.apache.org/jira/browse/TEZ-2509
             Project: Apache Tez
          Issue Type: Bug
            Reporter: Hitesh Shah
            Assignee: Hitesh Shah


Observed when doing some recovery testing: 

Failure as during dag shutdown, 4 attempts of the same task failed. 

{code}
2015-06-01 07:38:27,184 INFO [Dispatcher thread: Central] history.HistoryEventHandler: [HISTORY][DAG:dag_1433141118424_0012_2][Event:TASK_FINISHED]: vertexName=initialmap, taskId=task_1433141118424_0012_2_00_000003, startTime=1433144297281, finishTime=1433144307184, timeTaken=9903, status=FAILED, successfulAttemptID=null, diagnostics=TaskAttempt 0 failed, info=[Container container_e02_1433141118424_0012_01_000018 hit an invalid transition - C_NM_STOP_SENT at RUNNING]
TaskAttempt 1 failed, info=[AttemptId: attempt_1433141118424_0012_2_00_000003_1 cannot be allocated to container: container_e02_1433141118424_0012_01_000011 in STOP_REQUESTED state]
TaskAttempt 2 failed, info=[Container container_e02_1433141118424_0012_01_000012 hit an invalid transition - C_NM_STOP_SENT at RUNNING]
TaskAttempt 3 failed, info=[Container container_e02_1433141118424_0012_01_000025 hit an invalid transition - C_NM_STOP_SENT at RUNNING], counters=Counters: 0
{code}
  

DAG kill signal received.
{code}
2015-06-01 07:38:25,811 INFO [Thread-3] app.DAGAppMaster: DAGAppMasterShutdownHook invoked
2015-06-01 07:38:25,811 INFO [Thread-3] app.DAGAppMaster: DAGAppMaster received a signal. Signaling TaskScheduler
{code}

First attempt marked as failed as container was killed.
{code}
2015-06-01 07:38:26,906 INFO [Dispatcher thread: Central] history.HistoryEventHandler: [HISTORY][DAG:dag_1433141118424_0012_2][Event:TASK_ATTEMPT_FINISHED]: vertexName=initialmap, taskAttemptId=attempt_1433141118424_0012_2_00_000003_0, startTime=1433144297281, finishTime=1433144306904, timeTaken=9623, status=FAILED, errorEnum=FRAMEWORK_ERROR, diagnostics=Container container_e02_1433141118424_0012_01_000018 hit an invalid transition - C_NM_STOP_SENT at RUNNING, counters=Counters: 0
{code}

Subsequent attempt scheduled, assigned and eventually fails. 
{code}
2015-06-01 07:38:26,919 INFO [DelayedContainerManager] rm.YarnTaskSchedulerService: Assigning container to task, container=Container: [ContainerId: container_e02_1433141118424_0012_01_000011, NodeId: ip-172-31-18-41.ec2.internal:45454, NodeHttpAddress: ip-172-31-18-41.ec2.internal:8042, Resource: <memory:1536, vCores:1>, Priority: 2, Token: Token { kind: ContainerToken, service: 172.31.18.41:45454 }, ], task=attempt_1433141118424_0012_2_00_000003_1, containerHost=ip-172-31, localityMatchType=NodeLocal, matchedLocation=ip-172-31-18-41.ec2.internal, honorLocalityFlags=true, reusedContainer=true, delayedContainers=4, containerResourceMemory=1536, containerResourceVCores=1
{code}

Scheduler stops too late.
{code}
2015-06-01 07:38:27,403 DEBUG [Thread-3] service.AbstractService: Service: org.apache.tez.dag.app.rm.YarnTaskSchedulerService entered state STOPPED
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)