You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Siddharth Seth (JIRA)" <ji...@apache.org> on 2013/12/05 02:15:35 UTC
[jira] [Resolved] (TEZ-660) 1TB teragen job failed due to 'TezUncheckedException: event.terminationCause is unexpected: INTERNAL_ERROR'

     [ https://issues.apache.org/jira/browse/TEZ-660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siddharth Seth resolved TEZ-660.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.3.0

Committed to master.

> 1TB teragen job failed due to 'TezUncheckedException: event.terminationCause is unexpected: INTERNAL_ERROR'
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: TEZ-660
>                 URL: https://issues.apache.org/jira/browse/TEZ-660
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.3.0
>            Reporter: Yesha Vora
>            Assignee: Siddharth Seth
>             Fix For: 0.3.0
>
>         Attachments: TEZ-660.txt
>
>
> Scenario:
> Only 1 teragen job ( to create 1 TB data) was running.
> While running this test, Due to some reason the node where AM was running became unreachable. It got declared unhealthy.
> ========
> 2013-12-03 02:56:55,525 INFO [ContainerLauncher #3] org.apache.hadoop.ipc.Client: Retrying connect to server: AM_hostname. Already tried 10 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1 SECONDS) 2013-12-03 02:56:58,526 INFO [ContainerLauncher #3] org.apache.hadoop.ipc.Client: Retrying connect to server: AM_hostname. Already tried 11 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1 SECONDS) 2013-12-03 02:56:59,452 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.rm.node.AMNodeImpl: AMNode AM_hostname transitioned from ACTIVE to UNHEALTHY 
> =========
> However, the expected behavior is RM should restart AM on other node. But I do not see second attempt for AM being launched.
> Instead, 1TB teragen job Failed with below Error. And final status of this job is Killed
> =========
> 2013-12-03 02:56:59,587 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.TaskImpl: task_1385937217601_0002_1_00_001387 Task Transitioned from SUCCEEDED to SCHEDULED
> 2013-12-03 02:56:59,587 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: dag_1385937217601_0002_1 terminating due to internal error
> 2013-12-03 02:56:59,991 INFO [AsyncDispatcher event handler] org.apache.tez.dag.app.dag.impl.DAGImpl: dag_1385937217601_0002_1 transitioned from RUNNING to ERROR
> 2013-12-03 02:56:59,999 FATAL [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread
> org.apache.tez.dag.api.TezUncheckedException: VertexKilledTransition: event.terminationCause is unexpected: INTERNAL_ERROR
>         at org.apache.tez.dag.app.dag.impl.VertexImpl$VertexKilledTransition.transition(VertexImpl.java:1683)
>         at org.apache.tez.dag.app.dag.impl.VertexImpl$VertexKilledTransition.transition(VertexImpl.java:1671)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>         at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:946)
>         at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:142)
>         at org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1252)
>         at org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1244)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:134)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:81)
>         at java.lang.Thread.run(Thread.java:662)
> 2013-12-03 02:57:00,001 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Exiting, bbye..
> 2013-12-03 02:57:00,037 INFO [Thread-2] org.apache.tez.dag.app.DAGAppMaster: DAGAppMaster received a signal. Signaling TaskScheduler
> =========



--
This message was sent by Atlassian JIRA
(v6.1#6144)