You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Jeff Zhang (JIRA)" <ji...@apache.org> on 2015/06/02 04:15:17 UTC

[jira] [Issue Comment Deleted] (TEZ-2509) YarnTaskSchedulerService should not try to allocate containers if AM is shutting down

     [ https://issues.apache.org/jira/browse/TEZ-2509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Zhang updated TEZ-2509:
----------------------------
    Comment: was deleted

(was: Should also stop assign containers in DelayedContainerManager ?

)

> YarnTaskSchedulerService should not try to allocate containers if AM is shutting down
> -------------------------------------------------------------------------------------
>
>                 Key: TEZ-2509
>                 URL: https://issues.apache.org/jira/browse/TEZ-2509
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Hitesh Shah
>            Assignee: Hitesh Shah
>         Attachments: TEZ-2509.1.patch, TEZ-2509.2.patch
>
>
> Observed when doing some recovery testing: 
> Failure as during dag shutdown, 4 attempts of the same task failed. 
> {code}
> 2015-06-01 07:38:27,184 INFO [Dispatcher thread: Central] history.HistoryEventHandler: [HISTORY][DAG:dag_1433141118424_0012_2][Event:TASK_FINISHED]: vertexName=initialmap, taskId=task_1433141118424_0012_2_00_000003, startTime=1433144297281, finishTime=1433144307184, timeTaken=9903, status=FAILED, successfulAttemptID=null, diagnostics=TaskAttempt 0 failed, info=[Container container_e02_1433141118424_0012_01_000018 hit an invalid transition - C_NM_STOP_SENT at RUNNING]
> TaskAttempt 1 failed, info=[AttemptId: attempt_1433141118424_0012_2_00_000003_1 cannot be allocated to container: container_e02_1433141118424_0012_01_000011 in STOP_REQUESTED state]
> TaskAttempt 2 failed, info=[Container container_e02_1433141118424_0012_01_000012 hit an invalid transition - C_NM_STOP_SENT at RUNNING]
> TaskAttempt 3 failed, info=[Container container_e02_1433141118424_0012_01_000025 hit an invalid transition - C_NM_STOP_SENT at RUNNING], counters=Counters: 0
> {code}
>   
> DAG kill signal received.
> {code}
> 2015-06-01 07:38:25,811 INFO [Thread-3] app.DAGAppMaster: DAGAppMasterShutdownHook invoked
> 2015-06-01 07:38:25,811 INFO [Thread-3] app.DAGAppMaster: DAGAppMaster received a signal. Signaling TaskScheduler
> {code}
> First attempt marked as failed as container was killed.
> {code}
> 2015-06-01 07:38:26,906 INFO [Dispatcher thread: Central] history.HistoryEventHandler: [HISTORY][DAG:dag_1433141118424_0012_2][Event:TASK_ATTEMPT_FINISHED]: vertexName=initialmap, taskAttemptId=attempt_1433141118424_0012_2_00_000003_0, startTime=1433144297281, finishTime=1433144306904, timeTaken=9623, status=FAILED, errorEnum=FRAMEWORK_ERROR, diagnostics=Container container_e02_1433141118424_0012_01_000018 hit an invalid transition - C_NM_STOP_SENT at RUNNING, counters=Counters: 0
> {code}
> Subsequent attempt scheduled, assigned and eventually fails. 
> {code}
> 2015-06-01 07:38:26,919 INFO [DelayedContainerManager] rm.YarnTaskSchedulerService: Assigning container to task, container=Container: [ContainerId: container_e02_1433141118424_0012_01_000011, NodeId: ip-172-31-18-41.ec2.internal:45454, NodeHttpAddress: ip-172-31-18-41.ec2.internal:8042, Resource: <memory:1536, vCores:1>, Priority: 2, Token: Token { kind: ContainerToken, service: 172.31.18.41:45454 }, ], task=attempt_1433141118424_0012_2_00_000003_1, containerHost=ip-172-31, localityMatchType=NodeLocal, matchedLocation=ip-172-31-18-41.ec2.internal, honorLocalityFlags=true, reusedContainer=true, delayedContainers=4, containerResourceMemory=1536, containerResourceVCores=1
> {code}
> Scheduler stops too late.
> {code}
> 2015-06-01 07:38:27,403 DEBUG [Thread-3] service.AbstractService: Service: org.apache.tez.dag.app.rm.YarnTaskSchedulerService entered state STOPPED
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)