You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tez.apache.org by Johannes Zillmann <jz...@googlemail.com> on 2014/11/17 15:45:53 UTC
AM shutdown while executing OutputCommitter
Dear Tez community,
a customer is running into following problem with a Tez-0.5 job.
On the client side it looks as follow:
- progress reporting shows that all tasks are complete:
(DagRunner.java:148) - Progress on 'Import job (26234): Raw History ImportJob#import(Group by operation)#import(Group by operation) (b22473b9-5053-40bc-9df6-02b8b38eb6d8)' in state RUNNING:
- then there seems to be connectivity problems to the Application Master (AM):
INFO [2014-11-14 14:32:03.635] (Client.java:837) - Retrying connect to server: master/172.30.31.27:34275. Already tried 0 time(s); retry policy is
…
INFO [2014-11-14 14:32:12.644] (Client.java:837) - Retrying connect to server: master/172.30.31.27:34275. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
- then the job appears to be failed:
INFO [2014-11-14 14:32:13.852] [MrPlanRunnerV2] - Finished DAG ‘XXX' (application=application_1414525508751_5385) with status=FAILED
On the AM side it looks as follow:
- all tasks are completed:
2014-11-14 14:31:36,442 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: Committing output: output for vertex: vertex_1414525508751_5385_1_01
- our implementation of org.apache.tez.runtime.api.OutputCommitter starts its work
- the AM suddenly gets shut down:
2014-11-14 14:31:42,169 INFO [DelayedContainerManager] org.apache.tez.dag.app.rm.YarnTaskSchedulerService: No taskRequests. Container's idle timeout delay expired or is new. Releasing container, containerId=container_1414525508751_5385_01_000056, containerExpiryTime=1416000701962, idleTimeout=5000, taskRequestsCount=0, heldContainers=1, delayedContainers=0, isNew=false
2014-11-14 14:31:42,170 INFO [DelayedContainerManager] org.apache.tez.dag.app.rm.YarnTaskSchedulerService: Releasing unused container: container_1414525508751_5385_01_000056
2014-11-14 14:32:01,848 INFO [
Thread
-1] org.apache.tez.dag.app.DAGAppMaster: DAGAppMasterShutdownHook invoked
2014-11-14 14:32:01,848 INFO [
Thread
-1] org.apache.tez.dag.app.DAGAppMaster: DAGAppMaster received a signal. Signaling TaskScheduler
2014-11-14 14:32:01,848 INFO [
Thread-1] org.apache.tez.dag.app.rm.TaskSchedulerEventHandler: TaskScheduler notified that iSignalled was : true
2014-11-14 14:32:01,849 INFO [Thread-1] org.apache.tez.dag.history.HistoryEventHandler: Stopping HistoryEventHandler
2014-11-14 14:32:01,849 INFO [
Thread
-1] org.apache.tez.dag.history.recovery.RecoveryService: Stopping RecoveryService
2014-11-14 14:32:01,849 INFO [
Thread
-1] org.apache.tez.dag.history.recovery.RecoveryService: Closing Summary Stream
2014-11-14 14:32:01,849 INFO [RecoveryEventHandlingThread] org.apache.tez.dag.history.recovery.RecoveryService: EventQueue take interrupted. Returning
2014-11-14 14:32:01,897 INFO [
Thread-1] org.apache.tez.dag.history.recovery.RecoveryService: Closing Output Stream for
DAG dag_1414525508751_5385_1
2014-11-14 14:32:01,905 INFO [
Thread
-1] org.apache.tez.dag.history.logging.impl.SimpleHistoryLoggingService: Stopping SimpleHistoryLoggingService, eventQueueBacklog=0
2014-11-14 14:32:01,905 INFO [HistoryEventHandlingThread] org.apache.tez.dag.history.logging.impl.SimpleHistoryLoggingService: EventQueue take interrupted. Returning
2014-11-14 14:32:01,906 INFO [
Thread-1] org.apache.tez.dag.app.launcher.ContainerLauncherImpl: Sending a stop request to the NM for
ContainerId: container_1414525508751_5385_01_000056
2014-11-14 14:32:01,913 INFO [DelayedContainerManager] org.apache.tez.dag.app.rm.YarnTaskSchedulerService: AllocatedContainerManager
Thread
interrupted
2014-11-14 14:32:01,915 ERROR [AMRM Callback Handler
Thread] org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[AMRM Callback Handler Thread,5,main] threw an Throwable, but we are shutting down, so ignoring this
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.tez.dag.api.TezUncheckedException: java.lang.InterruptedException
at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:313)
Caused by: org.apache.tez.dag.api.TezUncheckedException: java.lang.InterruptedException
at org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper.getProgress(TaskSchedulerAppCallbackWrapper.java:106)
at org.apache.tez.dag.app.rm.YarnTaskSchedulerService.getProgress(YarnTaskSchedulerService.java:852)
at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:309)
Caused by: java.lang.InterruptedException
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:400)
at java.util.concurrent.FutureTask.get(FutureTask.java:187)
at org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper.getProgress(TaskSchedulerAppCallbackWrapper.java:104)
Any idea on what the problem could be ?
Is there a timeout for the OutputCommitter hits and the AM is killed by resource-manager ?
Does the client kills the application via resource-manager when it can’t reach the application master anymore ?
How big the chance is that the AM i unresponsive while its executing the OutputCommitter ?
Any help appreciated!
Johannes
Re: AM shutdown while executing OutputCommitter
Posted by Siddharth Seth <ss...@apache.org>.
The signal to the AM could have come in from the NodeManager if the AM
exceeds it's memory limits. Do you have YARN memory enforcement enabled on
this cluster ? You could look up the NM logs to see if this were the case.
Otherwise the RM logs to figure out when and if a KILL request for the app
was sent.
On Mon, Nov 17, 2014 at 6:45 AM, Johannes Zillmann <jzillmann@googlemail.com
> wrote:
> Dear Tez community,
>
> a customer is running into following problem with a Tez-0.5 job.
>
> On the client side it looks as follow:
>
> - progress reporting shows that all tasks are complete:
> (DagRunner.java:148) - Progress on 'Import job (26234): Raw
> History ImportJob#import(Group by operation)#import(Group by operation)
> (b22473b9-5053-40bc-9df6-02b8b38eb6d8)' in state RUNNING:
> - then there seems to be connectivity problems to the Application Master
> (AM):
> INFO [2014-11-14 14:32:03.635] (Client.java:837) - Retrying
> connect to server: master/172.30.31.27:34275. Already tried 0 time(s);
> retry policy is
> …
> INFO [2014-11-14 14:32:12.644] (Client.java:837) - Retrying
> connect to server: master/172.30.31.27:34275. Already tried 9 time(s);
> retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
> sleepTime=1000 MILLISECONDS)
> - then the job appears to be failed:
> INFO [2014-11-14 14:32:13.852] [MrPlanRunnerV2] - Finished DAG
> ‘XXX' (application=application_1414525508751_5385) with status=FAILED
>
>
> On the AM side it looks as follow:
> - all tasks are completed:
> 2014-11-14 14:31:36,442 INFO [AsyncDispatcher event handler]
> org.apache.tez.dag.app.dag.impl.DAGImpl: Committing output: output for
> vertex: vertex_1414525508751_5385_1_01
> - our implementation of org.apache.tez.runtime.api.OutputCommitter starts
> its work
> - the AM suddenly gets shut down:
> 2014-11-14 14:31:42,169 INFO [DelayedContainerManager]
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService: No taskRequests.
> Container's idle timeout delay expired or is new. Releasing container,
> containerId=container_1414525508751_5385_01_000056,
> containerExpiryTime=1416000701962, idleTimeout=5000, taskRequestsCount=0,
> heldContainers=1, delayedContainers=0, isNew=false
> 2014-11-14 14:31:42,170 INFO [DelayedContainerManager]
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService: Releasing unused
> container: container_1414525508751_5385_01_000056
> 2014-11-14 14:32:01,848 INFO [
> Thread
> -1] org.apache.tez.dag.app.DAGAppMaster: DAGAppMasterShutdownHook
> invoked
> 2014-11-14 14:32:01,848 INFO [
> Thread
> -1] org.apache.tez.dag.app.DAGAppMaster: DAGAppMaster received a
> signal. Signaling TaskScheduler
> 2014-11-14 14:32:01,848 INFO [
> Thread-1] org.apache.tez.dag.app.rm.TaskSchedulerEventHandler:
> TaskScheduler notified that iSignalled was : true
> 2014-11-14 14:32:01,849 INFO [Thread-1]
> org.apache.tez.dag.history.HistoryEventHandler: Stopping HistoryEventHandler
> 2014-11-14 14:32:01,849 INFO [
> Thread
> -1] org.apache.tez.dag.history.recovery.RecoveryService: Stopping
> RecoveryService
> 2014-11-14 14:32:01,849 INFO [
> Thread
> -1] org.apache.tez.dag.history.recovery.RecoveryService: Closing Summary
> Stream
> 2014-11-14 14:32:01,849 INFO [RecoveryEventHandlingThread]
> org.apache.tez.dag.history.recovery.RecoveryService: EventQueue take
> interrupted. Returning
> 2014-11-14 14:32:01,897 INFO [
> Thread-1] org.apache.tez.dag.history.recovery.RecoveryService: Closing
> Output Stream for
> DAG dag_1414525508751_5385_1
> 2014-11-14 14:32:01,905 INFO [
> Thread
> -1] org.apache.tez.dag.history.logging.impl.SimpleHistoryLoggingService:
> Stopping SimpleHistoryLoggingService, eventQueueBacklog=0
> 2014-11-14 14:32:01,905 INFO [HistoryEventHandlingThread]
> org.apache.tez.dag.history.logging.impl.SimpleHistoryLoggingService:
> EventQueue take interrupted. Returning
> 2014-11-14 14:32:01,906 INFO [
> Thread-1] org.apache.tez.dag.app.launcher.ContainerLauncherImpl: Sending a
> stop request to the NM for
> ContainerId: container_1414525508751_5385_01_000056
> 2014-11-14 14:32:01,913 INFO [DelayedContainerManager]
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService:
> AllocatedContainerManager
> Thread
> interrupted
> 2014-11-14 14:32:01,915 ERROR [AMRM Callback Handler
> Thread] org.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread
> Thread[AMRM Callback Handler Thread,5,main] threw an Throwable, but we are
> shutting down, so ignoring this
>
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException:
> org.apache.tez.dag.api.TezUncheckedException: java.lang.InterruptedException
> at
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:313)
> Caused by: org.apache.tez.dag.api.TezUncheckedException:
> java.lang.InterruptedException
> at
> org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper.getProgress(TaskSchedulerAppCallbackWrapper.java:106)
> at
> org.apache.tez.dag.app.rm.YarnTaskSchedulerService.getProgress(YarnTaskSchedulerService.java:852)
> at
> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:309)
> Caused by: java.lang.InterruptedException
> at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:400)
> at java.util.concurrent.FutureTask.get(FutureTask.java:187)
> at
> org.apache.tez.dag.app.rm.TaskSchedulerAppCallbackWrapper.getProgress(TaskSchedulerAppCallbackWrapper.java:104)
>
> Any idea on what the problem could be ?
> Is there a timeout for the OutputCommitter hits and the AM is killed by
> resource-manager ?
> Does the client kills the application via resource-manager when it can’t
> reach the application master anymore ?
> How big the chance is that the AM i unresponsive while its executing the
> OutputCommitter ?
>
> Any help appreciated!
> Johannes