You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Feng Yuan (JIRA)" <ji...@apache.org> on 2016/05/24 07:45:13 UTC
[jira] [Updated] (TEZ-3266) DAG failed when yarn resources is rare
like " No groups available for user A" because DAGAppMaster launched and
exit_with_sucessful immediately.
[ https://issues.apache.org/jira/browse/TEZ-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Feng Yuan updated TEZ-3266:
---------------------------
Description:
When in a resource queue there is full of apps,if you submit a new tez app,there is infomations like :
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1463493135662_66844_01_000004 Container Transitioned from ALLOCATED to ACQUIRED
2016-05-24 01:52:02,963 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Application with id 66847 submitted by user bae
2016-05-24 01:52:02,963 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Storing application with id application_1463493135662_66847
2016-05-24 01:52:02,963 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=bae IP=192.168.44.40 OPERATION=Submit Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1463493135662_66847
2016-05-24 01:52:02,963 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1463493135662_66847 State change from NEW to NEW_SAVING
2016-05-24 01:52:02,964 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing info for app: application_1463493135662_66847
2016-05-24 01:52:02,966 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1463493135662_66847 State change from NEW_SAVING to SUBMITTED
2016-05-24 01:52:02,966 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user bae
2016-05-24 01:52:02,966 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user bae
2016-05-24 01:52:02,966 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user bae
2016-05-24 01:52:02,966 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Accepted application application_1463493135662_66847 from user: bae, in queue: default, currently num of applications: 16
2016-05-24 01:52:02,967 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1463493135662_66847 State change from SUBMITTED to ACCEPTED
2016-05-24 01:52:02,967 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1463493135662_66847_000001
2016-05-24 01:52:02,967 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1463493135662_66847_000001 State change from NEW to SUBMITTED
2016-05-24 01:52:02,967 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added Application Attempt appattempt_1463493135662_66847_000001 to scheduler from user: bae
2016-05-24 01:52:02,967 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1463493135662_66847_000001 State change from SUBMITTED to SCHEDULED
2016-05-24 01:52:02,976 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1463493135662_66837_000001 (auth:SIMPLE)
2016-05-24 01:52:03,044 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed...
and then a container assigned to AM but am cant running,it's log:
2016-05-24 01:51:57,610 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container container_1463493135662_66846_01_000004 succeeded
2016-05-24 01:51:57,610 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1463493135662_66846_01_000004 transitioned from RUNNING to EXITED_WITH_SUCCESS
you can see it can launch but why exit right away?
and because of AM is exit,you will see these log in hive.log:
2.2016-05-24 01:52:13,921 ERROR [Thread-18]: exec.Task (TezTask.java:execute(184)) - Failed to execute tez graph.
org.apache.tez.dag.api.SessionNotRunning: TezSession has already shutdown
at org.apache.tez.client.TezClient.waitTillReady(TezClient.java:603)
at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:212)
at org.apache.hadoop.hive.ql.exec.tez.TezTask.updateSession(TezTask.java:234)
at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:136)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:72)
3.2016-05-24 01:52:14,597 INFO [main]: client.TezClient (TezClient.java:stop(444)) - Failed to shutdown Tez Session via proxy
org.apache.tez.dag.api.SessionNotRunning: Application not running, applicationId=application_1463493135662_66847, yarnApplicationState=FAILED, finalApplicationStatus=FAILED, trackingUrl=http://bjlg-44p12-rm01:8088/cluster/app/application_1463493135662_66847
at org.apache.tez.client.TezClientUtils.getSessionAMProxy(TezClientUtils.java:787)
at org.apache.tez.client.TezClient.getSessionAMProxy(TezClient.java:623)
at org.apache.tez.client.TezClient.stop(TezClient.java:436)
at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.close(TezSessionState.java:269)
at org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.close(TezSessionPoolManager.java:176)
at org.apache.hadoop.hive.ql.session.SessionState.close(SessionState.java:1256)
at org.apache.hadoop.hive.cli.CliSessionState.close(CliSessionState.java:108)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:683)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:616)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
2016-05-24 01:52:14,597 INFO [main]: client.TezClient (TezClient.java:stop(449)) - Could not connect to AM, killing session via YARN, sessionName=XXX, applicationId=application_1463493135662_66847
was:
When in a resource queue there is full of apps,if you submit a new tez app,there is infomations like :
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1463493135662_66844_01_000004 Container Transitioned from ALLOCATED to ACQUIRED
2016-05-24 01:52:02,963 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Application with id 66847 submitted by user bae
2016-05-24 01:52:02,963 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Storing application with id application_1463493135662_66847
2016-05-24 01:52:02,963 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=bae IP=192.168.44.40 OPERATION=Submit Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1463493135662_66847
2016-05-24 01:52:02,963 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1463493135662_66847 State change from NEW to NEW_SAVING
2016-05-24 01:52:02,964 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing info for app: application_1463493135662_66847
2016-05-24 01:52:02,966 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1463493135662_66847 State change from NEW_SAVING to SUBMITTED
2016-05-24 01:52:02,966 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user bae
2016-05-24 01:52:02,966 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user bae
2016-05-24 01:52:02,966 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user bae
2016-05-24 01:52:02,966 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Accepted application application_1463493135662_66847 from user: bae, in queue: default, currently num of applications: 16
2016-05-24 01:52:02,967 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1463493135662_66847 State change from SUBMITTED to ACCEPTED
2016-05-24 01:52:02,967 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1463493135662_66847_000001
2016-05-24 01:52:02,967 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1463493135662_66847_000001 State change from NEW to SUBMITTED
2016-05-24 01:52:02,967 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added Application Attempt appattempt_1463493135662_66847_000001 to scheduler from user: bae
2016-05-24 01:52:02,967 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1463493135662_66847_000001 State change from SUBMITTED to SCHEDULED
2016-05-24 01:52:02,976 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1463493135662_66837_000001 (auth:SIMPLE)
2016-05-24 01:52:03,044 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed...
> DAG failed when yarn resources is rare like " No groups available for user A" because DAGAppMaster launched and exit_with_sucessful immediately.
> -------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: TEZ-3266
> URL: https://issues.apache.org/jira/browse/TEZ-3266
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.5.2
> Environment: hadoop-2.6.0, hive-0.14.0
> Reporter: Feng Yuan
>
> When in a resource queue there is full of apps,if you submit a new tez app,there is infomations like :
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1463493135662_66844_01_000004 Container Transitioned from ALLOCATED to ACQUIRED
> 2016-05-24 01:52:02,963 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Application with id 66847 submitted by user bae
> 2016-05-24 01:52:02,963 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Storing application with id application_1463493135662_66847
> 2016-05-24 01:52:02,963 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=bae IP=192.168.44.40 OPERATION=Submit Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1463493135662_66847
> 2016-05-24 01:52:02,963 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1463493135662_66847 State change from NEW to NEW_SAVING
> 2016-05-24 01:52:02,964 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing info for app: application_1463493135662_66847
> 2016-05-24 01:52:02,966 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1463493135662_66847 State change from NEW_SAVING to SUBMITTED
> 2016-05-24 01:52:02,966 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user bae
> 2016-05-24 01:52:02,966 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user bae
> 2016-05-24 01:52:02,966 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user bae
> 2016-05-24 01:52:02,966 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Accepted application application_1463493135662_66847 from user: bae, in queue: default, currently num of applications: 16
> 2016-05-24 01:52:02,967 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1463493135662_66847 State change from SUBMITTED to ACCEPTED
> 2016-05-24 01:52:02,967 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1463493135662_66847_000001
> 2016-05-24 01:52:02,967 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1463493135662_66847_000001 State change from NEW to SUBMITTED
> 2016-05-24 01:52:02,967 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Added Application Attempt appattempt_1463493135662_66847_000001 to scheduler from user: bae
> 2016-05-24 01:52:02,967 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1463493135662_66847_000001 State change from SUBMITTED to SCHEDULED
> 2016-05-24 01:52:02,976 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1463493135662_66837_000001 (auth:SIMPLE)
> 2016-05-24 01:52:03,044 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed...
> and then a container assigned to AM but am cant running,it's log:
> 2016-05-24 01:51:57,610 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container container_1463493135662_66846_01_000004 succeeded
> 2016-05-24 01:51:57,610 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1463493135662_66846_01_000004 transitioned from RUNNING to EXITED_WITH_SUCCESS
> you can see it can launch but why exit right away?
> and because of AM is exit,you will see these log in hive.log:
> 2.2016-05-24 01:52:13,921 ERROR [Thread-18]: exec.Task (TezTask.java:execute(184)) - Failed to execute tez graph.
> org.apache.tez.dag.api.SessionNotRunning: TezSession has already shutdown
> at org.apache.tez.client.TezClient.waitTillReady(TezClient.java:603)
> at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:212)
> at org.apache.hadoop.hive.ql.exec.tez.TezTask.updateSession(TezTask.java:234)
> at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:136)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:160)
> at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
> at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:72)
> 3.2016-05-24 01:52:14,597 INFO [main]: client.TezClient (TezClient.java:stop(444)) - Failed to shutdown Tez Session via proxy
> org.apache.tez.dag.api.SessionNotRunning: Application not running, applicationId=application_1463493135662_66847, yarnApplicationState=FAILED, finalApplicationStatus=FAILED, trackingUrl=http://bjlg-44p12-rm01:8088/cluster/app/application_1463493135662_66847
> at org.apache.tez.client.TezClientUtils.getSessionAMProxy(TezClientUtils.java:787)
> at org.apache.tez.client.TezClient.getSessionAMProxy(TezClient.java:623)
> at org.apache.tez.client.TezClient.stop(TezClient.java:436)
> at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.close(TezSessionState.java:269)
> at org.apache.hadoop.hive.ql.exec.tez.TezSessionPoolManager.close(TezSessionPoolManager.java:176)
> at org.apache.hadoop.hive.ql.session.SessionState.close(SessionState.java:1256)
> at org.apache.hadoop.hive.cli.CliSessionState.close(CliSessionState.java:108)
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:683)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:616)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> 2016-05-24 01:52:14,597 INFO [main]: client.TezClient (TezClient.java:stop(449)) - Could not connect to AM, killing session via YARN, sessionName=XXX, applicationId=application_1463493135662_66847
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)