You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tez.apache.org by Johannes Zillmann <jz...@googlemail.com> on 2014/11/17 15:45:53 UTC

AM shutdown while executing OutputCommitter

Dear Tez community,

a customer is running into following problem with a Tez-0.5 job.

On the client side it looks as follow:

- progress reporting shows that all tasks are complete:
	(DagRunner.java:148) - Progress on 'Import job (26234): Raw History ImportJob#import(Group by operation)#import(Group by operation) (b22473b9-5053-40bc-9df6-02b8b38eb6d8)' in state RUNNING: 
- then there seems to be connectivity problems to the Application Master (AM):
	INFO [2014-11-14 14:32:03.635] (Client.java:837) - Retrying connect to server: master/172.30.31.27:34275. Already tried 0 time(s); retry policy is 
	…
	INFO [2014-11-14 14:32:12.644] (Client.java:837) - Retrying connect to server: master/172.30.31.27:34275. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
- then the job appears to be failed:
	INFO [2014-11-14 14:32:13.852] [MrPlanRunnerV2] - Finished DAG ‘XXX' (application=application_1414525508751_5385) with status=FAILED


On the AM side it looks as follow:
- all tasks are completed:
	2014-11-14 14:31:36,442 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: Committing output: output for vertex: vertex_1414525508751_5385_1_01
- our implementation of org.apache.tez.runtime.api.OutputCommitter starts its work
- the AM suddenly gets shut down:
	2014-11-14 14:31:42,169 INFO [DelayedContainerManager] org.apache.tez.dag.app.rm.YarnTaskSchedulerService: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_1414525508751_5385_01_000056, containerExpiryTime=1416000701962, idleTimeout=5000, taskRequestsCount=0, heldContainers=1, delayedContainers=0, isNew=false
	2014-11-14 14:31:42,170 INFO [DelayedContainerManager] org.apache.tez.dag.app.rm.YarnTaskSchedulerService: Releasing unused container: container_1414525508751_5385_01_000056
	2014-11-14 14:32:01,848 INFO [
	Thread
	-1] org.apache.tez.dag.app.DAGAppMaster: DAGAppMasterShutdownHook invoked
	2014-11-14 14:32:01,848 INFO [
	Thread
	-1] org.apache.tez.dag.app.DAGAppMaster: DAGAppMaster received a signal. Signaling TaskScheduler
	2014-11-14 14:32:01,848 INFO [
	Thread-1] org.apache.tez.dag.app.rm.TaskSchedulerEventHandler: TaskScheduler notified that iSignalled was : true
	2014-11-14 14:32:01,849 INFO [Thread-1] org.apache.tez.dag.history.HistoryEventHandler: Stopping HistoryEventHandler
2014-11-14 14:32:01,849 INFO [
Thread
-1] org.apache.tez.dag.history.recovery.RecoveryService: Stopping RecoveryService
2014-11-14 14:32:01,849 INFO [
Thread
-1] org.apache.tez.dag.history.recovery.RecoveryService: Closing Summary Stream
2014-11-14 14:32:01,849 INFO [RecoveryEventHandlingThread] org.apache.tez.dag.history.recovery.RecoveryService: EventQueue take interrupted. Returning
2014-11-14 14:32:01,897 INFO [
Thread-1] org.apache.tez.dag.history.recovery.RecoveryService: Closing Output Stream for
DAG dag_1414525508751_5385_1
2014-11-14 14:32:01,905 INFO [
Thread
-1] org.apache.tez.dag.history.logging.impl.SimpleHistoryLoggingService: Stopping SimpleHistoryLoggingService, eventQueueBacklog=0
2014-11-14 14:32:01,905 INFO [HistoryEventHandlingThread] org.apache.tez.dag.history.logging.impl.SimpleHistoryLoggingService: EventQueue take interrupted. Returning
2014-11-14 14:32:01,906 INFO [
Thread-1] org.apache.tez.dag.app.launcher.ContainerLauncherImpl: Sending a stop request to the NM for
ContainerId: container_1414525508751_5385_01_000056
2014-11-14 14:32:01,913 INFO [DelayedContainerManager] org.apache.tez.dag.app.rm.YarnTaskSchedulerService: AllocatedContainerManager 
Thread
interrupted
2014-11-14 14:32:01,915 ERROR [AMRM Callback Handler 
Thread] org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[AMRM Callback Handler Thread,5,main] threw an Throwable, but we are shutting down, so ignoring this

org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.tez.dag.api.TezUncheckedException: java.lang.InterruptedException
	at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:313)
Caused by: org.apache.tez.dag.api.TezUncheckedException: java.lang.InterruptedException
	at org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper.getProgress(TaskSchedulerAppCallbackWrapper.java:106)
	at org.apache.tez.dag.app.rm.YarnTaskSchedulerService.getProgress(YarnTaskSchedulerService.java:852)
	at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:309)
Caused by: java.lang.InterruptedException
	at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:400)
	at java.util.concurrent.FutureTask.get(FutureTask.java:187)
	at org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper.getProgress(TaskSchedulerAppCallbackWrapper.java:104)

Any idea on what the problem could be ?
Is there a timeout for the OutputCommitter hits and the AM is killed by resource-manager ?
Does the client kills the application via resource-manager when it can’t reach the application master anymore ?
How big the chance is that the AM i unresponsive while its executing the OutputCommitter ?

Any help appreciated!
Johannes

Re: AM shutdown while executing OutputCommitter

Posted by Siddharth Seth <ss...@apache.org>.
The signal to the AM could have come in from the NodeManager if the AM
exceeds it's memory limits. Do you have YARN memory enforcement enabled on
this cluster ? You could look up the NM logs to see if this were the case.
Otherwise the RM logs to figure out when and if a KILL request for the app
was sent.

On Mon, Nov 17, 2014 at 6:45 AM, Johannes Zillmann <jzillmann@googlemail.com
> wrote:

> Dear Tez community,
>
> a customer is running into following problem with a Tez-0.5 job.
>
> On the client side it looks as follow:
>
> - progress reporting shows that all tasks are complete:
>         (DagRunner.java:148) - Progress on 'Import job (26234): Raw
> History ImportJob#import(Group by operation)#import(Group by operation)
> (b22473b9-5053-40bc-9df6-02b8b38eb6d8)' in state RUNNING:
> - then there seems to be connectivity problems to the Application Master
> (AM):
>         INFO [2014-11-14 14:32:03.635] (Client.java:837) - Retrying
> connect to server: master/172.30.31.27:34275. Already tried 0 time(s);
> retry policy is
>         …
>         INFO [2014-11-14 14:32:12.644] (Client.java:837) - Retrying
> connect to server: master/172.30.31.27:34275. Already tried 9 time(s);
> retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
> sleepTime=1000 MILLISECONDS)
> - then the job appears to be failed:
>         INFO [2014-11-14 14:32:13.852] [MrPlanRunnerV2] - Finished DAG
> ‘XXX' (application=application_1414525508751_5385) with status=FAILED
>
>
> On the AM side it looks as follow:
> - all tasks are completed:
>         2014-11-14 14:31:36,442 INFO [AsyncDispatcher event handler]
> org.apache.tez.dag.app.dag.impl.DAGImpl: Committing output: output for
> vertex: vertex_1414525508751_5385_1_01
> - our implementation of org.apache.tez.runtime.api.OutputCommitter starts
> its work
> - the AM suddenly gets shut down:
>         2014-11-14 14:31:42,169 INFO [DelayedContainerManager]
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService: No taskRequests.
> Container's idle timeout delay expired or is new. Releasing container,
> containerId=container_1414525508751_5385_01_000056,
> containerExpiryTime=1416000701962, idleTimeout=5000, taskRequestsCount=0,
> heldContainers=1, delayedContainers=0, isNew=false
>         2014-11-14 14:31:42,170 INFO [DelayedContainerManager]
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService: Releasing unused
> container: container_1414525508751_5385_01_000056
>         2014-11-14 14:32:01,848 INFO [
>         Thread
>         -1] org.apache.tez.dag.app.DAGAppMaster: DAGAppMasterShutdownHook
> invoked
>         2014-11-14 14:32:01,848 INFO [
>         Thread
>         -1] org.apache.tez.dag.app.DAGAppMaster: DAGAppMaster received a
> signal. Signaling TaskScheduler
>         2014-11-14 14:32:01,848 INFO [
>         Thread-1] org.apache.tez.dag.app.rm.TaskSchedulerEventHandler:
> TaskScheduler notified that iSignalled was : true
>         2014-11-14 14:32:01,849 INFO [Thread-1]
> org.apache.tez.dag.history.HistoryEventHandler: Stopping HistoryEventHandler
> 2014-11-14 14:32:01,849 INFO [
> Thread
> -1] org.apache.tez.dag.history.recovery.RecoveryService: Stopping
> RecoveryService
> 2014-11-14 14:32:01,849 INFO [
> Thread
> -1] org.apache.tez.dag.history.recovery.RecoveryService: Closing Summary
> Stream
> 2014-11-14 14:32:01,849 INFO [RecoveryEventHandlingThread]
> org.apache.tez.dag.history.recovery.RecoveryService: EventQueue take
> interrupted. Returning
> 2014-11-14 14:32:01,897 INFO [
> Thread-1] org.apache.tez.dag.history.recovery.RecoveryService: Closing
> Output Stream for
> DAG dag_1414525508751_5385_1
> 2014-11-14 14:32:01,905 INFO [
> Thread
> -1] org.apache.tez.dag.history.logging.impl.SimpleHistoryLoggingService:
> Stopping SimpleHistoryLoggingService, eventQueueBacklog=0
> 2014-11-14 14:32:01,905 INFO [HistoryEventHandlingThread]
> org.apache.tez.dag.history.logging.impl.SimpleHistoryLoggingService:
> EventQueue take interrupted. Returning
> 2014-11-14 14:32:01,906 INFO [
> Thread-1] org.apache.tez.dag.app.launcher.ContainerLauncherImpl: Sending a
> stop request to the NM for
> ContainerId: container_1414525508751_5385_01_000056
> 2014-11-14 14:32:01,913 INFO [DelayedContainerManager]
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService:
> AllocatedContainerManager
> Thread
> interrupted
> 2014-11-14 14:32:01,915 ERROR [AMRM Callback Handler
> Thread] org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread
> Thread[AMRM Callback Handler Thread,5,main] threw an Throwable, but we are
> shutting down, so ignoring this
>
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException:
> org.apache.tez.dag.api.TezUncheckedException: java.lang.InterruptedException
>         at
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:313)
> Caused by: org.apache.tez.dag.api.TezUncheckedException:
> java.lang.InterruptedException
>         at
> org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper.getProgress(TaskSchedulerAppCallbackWrapper.java:106)
>         at
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService.getProgress(YarnTaskSchedulerService.java:852)
>         at
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:309)
> Caused by: java.lang.InterruptedException
>         at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:400)
>         at java.util.concurrent.FutureTask.get(FutureTask.java:187)
>         at
> org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper.getProgress(TaskSchedulerAppCallbackWrapper.java:104)
>
> Any idea on what the problem could be ?
> Is there a timeout for the OutputCommitter hits and the AM is killed by
> resource-manager ?
> Does the client kills the application via resource-manager when it can’t
> reach the application master anymore ?
> How big the chance is that the AM i unresponsive while its executing the
> OutputCommitter ?
>
> Any help appreciated!
> Johannes