You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Rajesh Balamohan (JIRA)" <ji...@apache.org> on 2014/09/11 02:18:33 UTC

[jira] [Commented] (TEZ-1357) Display better diagnostics when AM fails to launch

    [ https://issues.apache.org/jira/browse/TEZ-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14129423#comment-14129423 ] 

Rajesh Balamohan commented on TEZ-1357:
---------------------------------------

Adding couple of scenarios that I encountered in local-vm

*From user/client side*:
>>>>>>
14/09/01 16:59:55 INFO client.TezClient: Tez system stage directory hdfs://tez-vm:56565/tmp/root/staging/.tez/application_1409615791617_0003 doesn't exist and is created
14/09/01 16:59:55 INFO client.TezClient: Submitting DAG to YARN, applicationId=application_1409615791617_0003, dagName=Tez_978
14/09/01 16:59:55 INFO impl.YarnClientImpl: Submitted application application_1409615791617_0003
14/09/01 16:59:55 INFO client.TezClient: The url to track the Tez AM: http://tez-vm:8088/proxy/application_1409615791617_0003/
14/09/01 16:59:55 INFO client.RMProxy: Connecting to ResourceManager at tez-vm/127.0.1.1:8032
14/09/01 16:59:55 INFO rpc.DAGClientRPCImpl: Waiting for DAG to start running
14/09/01 16:59:59 INFO rpc.DAGClientRPCImpl: DAG completed. FinalState=FAILED
>>>>>>

*Real cause (from node manager logs)*:
It would be helpful to let the user know why the AM launch failed
>>>>>>>>>
org.apache.hadoop.util.Shell$ExitCodeException: /grid/0/tmp/nm-local/usercache/root/appcache/application_1409615791617_0001/container_1409615791617_0001_02_000001/launch_container.sh: line 96: $JAVA_HOME/bin/java  -Xmx240m ${mapreduce.map.java.opts} -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/ -Dlog4j.configuration=tez-container-log4j.properties -Dyarn.app.container.log.dir=/grid/0/tmp/nm-logs/application_1409615791617_0001/container_1409615791617_0001_02_000001 -Dtez.root.logger=INFO,CLA -Dsun.nio.ch.bugLevel='' org.apache.tez.dag.app.DAGAppMaster 1>/grid/0/tmp/nm-logs/application_1409615791617_0001/container_1409615791617_0001_02_000001/stdout 2>/grid/0/tmp/nm-logs/application_1409615791617_0001/container_1409615791617_0001_02_000001/stderr : bad substitution

        at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
        at org.apache.hadoop.util.Shell.run(Shell.java:418)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
>>>>>>>>>

Another case:
==========
User accidentally setting the vertex parallelism to “0” instead of “-1”. (I agree that javadoc clearly calls it out to set to -1. But this was done by mistake).

*From user/client side*:
>>>>>>
14/09/01 17:05:33 INFO client.RMProxy: Connecting to ResourceManager at tez-vm/127.0.1.1:8032
14/09/01 17:05:33 INFO rpc.DAGClientRPCImpl: Waiting for DAG to start running
14/09/01 17:05:36 INFO rpc.DAGClientRPCImpl: DAG initialized: CurrentState=Running
14/09/01 17:05:37 INFO ipc.Client: Retrying connect to server: tez-vm/127.0.1.1:51916. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
14/09/01 17:05:38 INFO ipc.Client: Retrying connect to server: tez-vm/127.0.1.1:51916. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
14/09/01 17:05:39 INFO ipc.Client: Retrying connect to server: tez-vm/127.0.1.1:51916. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
>>>>>>


*Real cause*:
Can this exception/message be shown at client side, so that it becomes easier for debugging
>>>>>>>>>
java.lang.IllegalStateException: Parallelism for the vertex should be set to -1 if the InputInitializer is setting parallelism, VertexName: M_1
        at com.google.common.base.Preconditions.checkState(Preconditions.java:145)
        at org.apache.tez.dag.app.dag.impl.RootInputVertexManager.onRootVertexInitialized(RootInputVertexManager.java:80)
        at org.apache.tez.dag.app.dag.impl.VertexManager.onRootVertexInitialized(VertexManager.java:260)
        at org.apache.tez.dag.app.dag.impl.VertexImpl$RootInputInitializedTransition.transition(VertexImpl.java:2726)
        at org.apache.tez.dag.app.dag.impl.VertexImpl$RootInputInitializedTransition.transition(VertexImpl.java:2718)
        at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
        at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
        at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
        at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
        at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1337)
>>>>>>>>>

> Display better diagnostics when AM fails to launch
> --------------------------------------------------
>
>                 Key: TEZ-1357
>                 URL: https://issues.apache.org/jira/browse/TEZ-1357
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Hitesh Shah
>            Assignee: Jeff Zhang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)