You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Siddharth Seth (JIRA)" <ji...@apache.org> on 2013/12/05 02:11:35 UTC
[jira] [Commented] (TEZ-660) 1TB teragen job failed due to
'TezUncheckedException: event.terminationCause is unexpected:
INTERNAL_ERROR'
[ https://issues.apache.org/jira/browse/TEZ-660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839637#comment-13839637 ]
Siddharth Seth commented on TEZ-660:
------------------------------------
The NodeFailure caused SUCCESSFUL attempts to be killed - irrespective of whether they belonged to a leaf vertex or not. Attempts belonging to leaf vertices can be left in the SUCCEEDED state.
> 1TB teragen job failed due to 'TezUncheckedException: event.terminationCause is unexpected: INTERNAL_ERROR'
> -----------------------------------------------------------------------------------------------------------
>
> Key: TEZ-660
> URL: https://issues.apache.org/jira/browse/TEZ-660
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.3.0
> Reporter: Yesha Vora
> Assignee: Siddharth Seth
> Attachments: TEZ-660.txt
>
>
> Scenario:
> Only 1 teragen job ( to create 1 TB data) was running.
> While running this test, Due to some reason the node where AM was running became unreachable. It got declared unhealthy.
> ========
> 2013-12-03 02:56:55,525 INFO [ContainerLauncher #3] org.apache.hadoop.ipc.Client: Retrying connect to server: AM_hostname. Already tried 10 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1 SECONDS) 2013-12-03 02:56:58,526 INFO [ContainerLauncher #3] org.apache.hadoop.ipc.Client: Retrying connect to server: AM_hostname. Already tried 11 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1 SECONDS) 2013-12-03 02:56:59,452 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.rm.node.AMNodeImpl: AMNode AM_hostname transitioned from ACTIVE to UNHEALTHY
> =========
> However, the expected behavior is RM should restart AM on other node. But I do not see second attempt for AM being launched.
> Instead, 1TB teragen job Failed with below Error. And final status of this job is Killed
> =========
> 2013-12-03 02:56:59,587 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.TaskImpl: task_1385937217601_0002_1_00_001387 Task Transitioned from SUCCEEDED to SCHEDULED
> 2013-12-03 02:56:59,587 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: dag_1385937217601_0002_1 terminating due to internal error
> 2013-12-03 02:56:59,991 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: dag_1385937217601_0002_1 transitioned from RUNNING to ERROR
> 2013-12-03 02:56:59,999 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
> org.apache.tez.dag.api.TezUncheckedException: VertexKilledTransition: event.terminationCause is unexpected: INTERNAL_ERROR
> at org.apache.tez.dag.app.dag.impl.VertexImpl$VertexKilledTransition.transition(VertexImpl.java:1683)
> at org.apache.tez.dag.app.dag.impl.VertexImpl$VertexKilledTransition.transition(VertexImpl.java:1671)
> at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
> at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:946)
> at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:142)
> at org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1252)
> at org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1244)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:134)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:81)
> at java.lang.Thread.run(Thread.java:662)
> 2013-12-03 02:57:00,001 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye..
> 2013-12-03 02:57:00,037 INFO [Thread-2] org.apache.tez.dag.app.DAGAppMaster: DAGAppMaster received a signal. Signaling TaskScheduler
> =========
--
This message was sent by Atlassian JIRA
(v6.1#6144)