You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Hitesh Shah (JIRA)" <ji...@apache.org> on 2016/05/24 15:08:12 UTC

[jira] [Comment Edited] (TEZ-3266) DAG failed when yarn resources is rare like " No groups available for user A" because DAGAppMaster launched and exit_with_sucessful immediately.

    [ https://issues.apache.org/jira/browse/TEZ-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15298326#comment-15298326 ] 

Hitesh Shah edited comment on TEZ-3266 at 5/24/16 3:07 PM:
-----------------------------------------------------------

Yes I think it might be related to "tez.session.client.timeout.secs. The lack of logs implies that the AM never came up. Killed by user implies that the Hive CLI ended up killing the app. The Hive CLI/Tez Client waits for the configured value of "tez.session.client.timeout.secs ( default 120 secs ) before giving up. You can set this value to -1 to wait forever to resolve your issue.  


was (Author: hitesh):
Actually I think it might be related to tez.am.dag.submit.timeout.secs. The lack of logs implies that the AM never came up. Killed by user implies that the Hive CLI ended up killing the app. The Hive CLI/Tez Client waits for the configured value of tez.am.dag.submit.timeout.secs ( default 300 secs ) before giving up. You can set this value to -1 to wait forever to resolve your issue.  

> DAG failed when yarn resources is rare like " No groups available for user A" because DAGAppMaster launched and exit_with_sucessful  immediately.
> -------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: TEZ-3266
>                 URL: https://issues.apache.org/jira/browse/TEZ-3266
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.5.2
>         Environment: hadoop-2.6.0, hive-0.14.0
>            Reporter: Feng Yuan
>
> When in a resource queue there is full of apps,if you submit a new tez app,there is infomations like :
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1463493135662_66844_01_000004 Container Transitioned from ALLOCATED to ACQUIRED
> 2016-05-24 01:52:02,963 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Application with id 66847 submitted by user bae
> 2016-05-24 01:52:02,963 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Storing application with id application_1463493135662_66847
> 2016-05-24 01:52:02,963 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=bae      IP=192.168.44.40        OPERATION=Submit Application Request    TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1463493135662_66847
> 2016-05-24 01:52:02,963 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1463493135662_66847 State change from NEW to NEW_SAVING
> 2016-05-24 01:52:02,964 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing info for app: application_1463493135662_66847
> 2016-05-24 01:52:02,966 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1463493135662_66847 State change from NEW_SAVING to SUBMITTED
> 2016-05-24 01:52:02,966 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user bae
> 2016-05-24 01:52:02,966 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user bae
> 2016-05-24 01:52:02,966 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user bae
> 2016-05-24 01:52:02,966 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Accepted application application_1463493135662_66847 from user: bae, in queue: default, currently num of applications: 16
> 2016-05-24 01:52:02,967 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1463493135662_66847 State change from SUBMITTED to ACCEPTED
> 2016-05-24 01:52:02,967 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1463493135662_66847_000001
> 2016-05-24 01:52:02,967 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1463493135662_66847_000001 State change from NEW to SUBMITTED
> 2016-05-24 01:52:02,967 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added Application Attempt appattempt_1463493135662_66847_000001 to scheduler from user: bae
> 2016-05-24 01:52:02,967 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1463493135662_66847_000001 State change from SUBMITTED to SCHEDULED
> 2016-05-24 01:52:02,976 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1463493135662_66837_000001 (auth:SIMPLE)
> 2016-05-24 01:52:03,044 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed...
> and then a container assigned to AM but am cant running,it's log:
> 2016-05-24 01:51:57,610 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container container_1463493135662_66846_01_000004 succeeded
> 2016-05-24 01:51:57,610 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1463493135662_66846_01_000004 transitioned from RUNNING to EXITED_WITH_SUCCESS
> you can see it can launch but why exit right away?
> and because of AM is exit,you will see these log in hive.log:
> 2.2016-05-24 01:52:13,921 ERROR [Thread-18]: exec.Task (TezTask.java:execute(184)) - Failed to execute tez graph.
> org.apache.tez.dag.api.SessionNotRunning: TezSession has already shutdown
>         at org.apache.tez.client.TezClient.waitTillReady(TezClient.java:603)
>         at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:212)
>         at org.apache.hadoop.hive.ql.exec.tez.TezTask.updateSession(TezTask.java:234)
>         at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:136)
>         at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
>         at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
>         at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:72)
> 3.2016-05-24 01:52:14,597 INFO  [main]: client.TezClient (TezClient.java:stop(444)) - Failed to shutdown Tez Session via proxy
> org.apache.tez.dag.api.SessionNotRunning: Application not running, applicationId=application_1463493135662_66847, yarnApplicationState=FAILED, finalApplicationStatus=FAILED, trackingUrl=http://rm01:8088/cluster/app/application_1463493135662_66847
>         at org.apache.tez.client.TezClientUtils.getSessionAMProxy(TezClientUtils.java:787)
>         at org.apache.tez.client.TezClient.getSessionAMProxy(TezClient.java:623)
>         at org.apache.tez.client.TezClient.stop(TezClient.java:436)
>         at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.close(TezSessionState.java:269)
>         at org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.close(TezSessionPoolManager.java:176)
>         at org.apache.hadoop.hive.ql.session.SessionState.close(SessionState.java:1256)
>         at org.apache.hadoop.hive.cli.CliSessionState.close(CliSessionState.java:108)
>         at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:683)
>         at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:616)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
>         at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> 2016-05-24 01:52:14,597 INFO  [main]: client.TezClient (TezClient.java:stop(449)) - Could not connect to AM, killing session via YARN, sessionName=XXX, applicationId=application_1463493135662_66847



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)