You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Bikas Saha (JIRA)" <ji...@apache.org> on 2013/09/26 00:11:05 UTC

[jira] [Commented] (TEZ-499) DAG does not shutdown cleanly on error/kill from client - AMRMHeartbeat thread does not stop.

    [ https://issues.apache.org/jira/browse/TEZ-499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13778154#comment-13778154 ] 

Bikas Saha commented on TEZ-499:
--------------------------------

This behavior is correct wrt AMRMClient. The problem is that the test asserts about app state not matching and shuts down the minicluster. If the minicluster RM dies before the AM unregisters then the AM will hang around for the yarn retry period. This probably shouldnt happen on linux if group id is used to launch and kill containers.
                
> DAG does not shutdown cleanly on error/kill from client - AMRMHeartbeat thread does not stop.
> ---------------------------------------------------------------------------------------------
>
>                 Key: TEZ-499
>                 URL: https://issues.apache.org/jira/browse/TEZ-499
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Hitesh Shah
>         Attachments: TestMRRJobsDAGApi.testMRRSleepJobDagSubmitAndKill.AM.syslog
>
>
> DAG killed due to user-initiated kill. failedVertices:0 killedVertices:3
> Invalid event V_INTERNAL_ERROR on Vertex vertex_1380133982871_0001_1_01, counters=Counters: 1, org.apache.tez.common.counters.DAGCounter, NUM_KILLED_TASKS=1
> 2013-09-25 11:33:15,547 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.DAGAppMaster: On DAG completion. Old state: KILLED new state: KILLED
> 2013-09-25 11:33:15,547 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.DAGAppMaster: Shutting down on completion of dag:dag_1380133982871_0001_1
> 2013-09-25 11:33:15,547 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.DAGAppMaster: Ignoring multiple shutdown events
> 2013-09-25 11:33:15,995 INFO [IPC Server handler 0 on 52821] org.apache.tez.dag.app.DAGAppMaster: Sending client kill to dag: dag_1380133982871_1_000001
> 2013-09-25 11:33:16,479 INFO [AMRM Callback Handler Thread] org.apache.hadoop.yarn.util.RackResolver: Resolved 10.11.2.158 to /default-rack
> 2013-09-25 11:33:16,480 INFO [AMRM Callback Handler Thread] org.apache.tez.dag.app.rm.TaskScheduler: Releasing container, No RM requests matching container: Container: [ContainerId: container_1380133982871_0001_01_000002, NodeId: 10.11.2.158:52797, NodeHttpAddress: 10.11.2.158:52800, Resource: <memory:1024, vCores:1>, Priority: 3, Token: Token { kind: ContainerToken, service: 10.11.2.158:52797 }, ]
> 2013-09-25 11:33:16,480 INFO [AMRM Callback Handler Thread] org.apache.tez.dag.app.rm.TaskScheduler: Allocated resource memory: 0 cpu:0
> 2013-09-25 11:33:20,546 INFO [AMShutdownThread] org.apache.tez.dag.app.DAGAppMaster: Calling stop for all the services
> 2013-09-25 11:33:20,548 INFO [AMShutdownThread] org.apache.tez.dag.history.HistoryEventHandler: Stopping HistoryEventHandler
> 2013-09-25 11:33:20,551 INFO [Thread-54] org.apache.tez.dag.app.rm.TaskScheduler: AllocatedContainerManager Thread interrupted
> 2013-09-25 11:33:48,483 INFO [AMRM Heartbeater thread] org.apache.hadoop.ipc.Client: Retrying connect to server: 10.11.2.158/10.11.2.158:52794. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
> 2013-09-25 11:33:49,484 INFO [AMRM Heartbeater thread] org.apache.hadoop.ipc.Client: Retrying connect to server: 10.11.2.158/10.11.2.158:52794. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
> 2013-09-25 11:33:50,485 INFO [AMRM Heartbeater thread] org.apache.hadoop.ipc.Client: Retrying connect to server: 10.11.2.158/10.11.2.158:52794. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
> 2013-09-25 11:33:51,485 INFO [AMRM Heartbeater thread] org.apache.hadoop.ipc.Client: Retrying connect to server: 10.11.2.158/10.11.2.158:52794. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
> 2013-09-25 11:33:52,486 INFO [AMRM Heartbeater thread] org.apache.hadoop.ipc.Client: Retrying connect to server: 10.11.2.158/10.11.2.158:52794. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
> 2013-09-25 11:33:53,486 INFO [AMRM Heartbeater thread] org.apache.hadoop.ipc.Client: Retrying connect to server: 10.11.2.158/10.11.2.158:52794. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
> 2013-09-25 11:33:54,487 INFO [AMRM Heartbeater thread] org.apache.hadoop.ipc.Client: Retrying connect to server: 10.11.2.158/10.11.2.158:52794. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
> 2013-09-25 11:33:55,487 INFO [AMRM Heartbeater thread] org.apache.hadoop.ipc.Client: Retrying connect to server: 10.11.2.158/10.11.2.158:52794. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira