You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Benjamin Bowman <bb...@gmail.com> on 2014/06/19 15:47:12 UTC

Hadoop 2.4 Fair Scheduler goes missing

All,

I recently upgraded to Hadoop 2.4 and I am seeing a problem with the
Resource Manager's fair scheduler.  After a couple days of full time
operation where several MR jobs are submitted per minute, the fair
scheduler will suddenly stop scheduling jobs.  The only way I have found to
remedy this is to restart the Resource Manager   I have full logs with
debug output when this occurs if you want them.   Below I have included two
sections from the log for two jobs (it is not the complete log because I
just grepped through it for the application IDs).  The first section is the
end of a successful job.  The second is also the end of a successful job
but it is the last job that ran before no more jobs were scheduled.  Notice
in the second section that we do not get "Application
appattempt_1403035251873_8570_000001 is done. finalState=FINISHED".  In
fact, we never hear from resourcemanager.fair.FairScheduler from this point
on.  Has anyone else experienced a similar issue?

Thanks,
Benjamin



CORRECT JOB COMPLETION:  (the lines in bold are the ones that are missing
from the next section)

2014-06-19 03:40:23,582 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1403035251873_8570_01_000001 Container Transitioned from RUNNING
to COMPLETED
2014-06-19 03:40:23,582 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp:
Completed container: container_1403035251873_8570_01_000001 in state:
COMPLETED event:FINISHED
2014-06-19 03:40:23,582 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
Processing event for appattempt_1403035251873_8570_000001 of type
CONTAINER_FINISHED
2014-06-19 03:40:23,582 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root
OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS
 APPID=application_1403035251873_8570
 CONTAINERID=container_1403035251873_8570_01_000001
2014-06-19 03:40:23,582 INFO
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
Unregistering app attempt : appattempt_1403035251873_8570_000001
2014-06-19 03:40:23,582 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode:
Released container container_1403035251873_8570_01_000001 of capacity
<memory:3072, vCores:1> on host majestic:40332, which currently has 0
containers, <memory:0, vCores:0> used and <memory:103196, vCores:19>
available, release resources=true
2014-06-19 03:40:23,582 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager:
Application finished, removing password for
appattempt_1403035251873_8570_000001
2014-06-19 03:40:23,582 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Application attempt appattempt_1403035251873_8570_000001 released container
container_1403035251873_8570_01_000001 on node: host: master:40332
#containers=0 available=<memory:103196, vCores:19> used=<memory:0,
vCores:0> with event: FINISHED
2014-06-19 03:40:23,582 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1403035251873_8570_000001 State change from FINISHING to FINISHED
2014-06-19 03:40:23,582 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing
event for application_1403035251873_8570 of type ATTEMPT_FINISHED
2014-06-19 03:40:23,582 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable:
Node offered to app: application_1403035251873_8570 reserved: false
2014-06-19 03:40:23,582 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
application_1403035251873_8570 State change from FINISHING to FINISHED
*2014-06-19 03:40:23,582 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Application appattempt_1403035251873_8570_000001 is done.
finalState=FINISHED*
*2014-06-19 03:40:23,582 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo:
Application application_1403035251873_8570 requests cleared*
2014-06-19 03:40:23,582 INFO
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher:
Cleaning master appattempt_1403035251873_8570_000001
2014-06-19 03:40:23,582 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: RMAppManager
processing event for application_1403035251873_8570 of type APP_COMPLETED
2014-06-19 03:40:23,583 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root
OPERATION=Application Finished - Succeeded      TARGET=RMAppManager
RESULT=SUCCESS  APPID=application_1403035251873_8570
2014-06-19 03:40:23,583 DEBUG
org.apache.hadoop.yarn.server.security.BaseNMTokenSecretManager: creating
password for appattempt_1403035251873_8570_000001 for user root to run on
NM master:40332
2014-06-19 03:40:23,583 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary:
appId=application_1403035251873_8570,name=insert into table dns
partit...loading_dns_2(Stage
1),user=root,queue=root.orc_queue,state=FINISHED,trackingUrl=
http://master:8088/proxy/application_1403035251873_8570/jobhistory/job/job_1403035251873_8570,appMasterHost=majestic,startTime=1403149195741,finishTime=1403149216331,finalStatus=SUCCEEDED
2014-06-19 03:40:23,583 DEBUG
org.apache.hadoop.security.UserGroupInformation: PrivilegedAction
as:appattempt_1403035251873_8570_000001 (auth:SIMPLE)
from:org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.getContainerMgrProxy(AMLauncher.java:171)
2014-06-19 03:40:23,584 DEBUG
org.apache.hadoop.security.UserGroupInformation: PrivilegedAction
as:appattempt_1403035251873_8570_000001 (auth:SIMPLE)
from:org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:711)
2014-06-19 03:40:23,589 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(1588438933) connection to master/<master_IP>:40332 from
appattempt_1403035251873_8570_000001: starting, having connections 1
2014-06-19 03:40:23,591 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(1588438933) connection to majestic/<master_IP>:40332 from
appattempt_1403035251873_8570_000001 sending #17371
2014-06-19 03:40:23,593 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(1588438933) connection to majestic/<master_IP>:40332 from
appattempt_1403035251873_8570_000001 got value #17371
2014-06-19 03:40:33,592 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(1588438933) connection to majestic/<master_IP>:40332 from
appattempt_1403035251873_8570_000001: closed
2014-06-19 03:40:33,592 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(1588438933) connection to majestic/<master_IP>:40332 from
appattempt_1403035251873_8570_000001: stopped, remaining connections 0

INCORRECT JOB COMPLETION (resourcemanager.fair.FairScheduler never
acknowledges finished job) - No more jobs get scheduled after this and I
never see resourcemanager.fair.FairScheduler again

2014-06-19 03:42:07,512 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1403035251873_8573_01_000001 Container Transitioned from RUNNING
to COMPLETED
2014-06-19 03:42:07,512 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp:
Completed container: container_1403035251873_8573_01_000001 in state:
COMPLETED event:FINISHED
2014-06-19 03:42:07,512 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root
OPERATION=AM Released Container TARGET=SchedulerApp     RESULT=SUCCESS
 APPID=application_1403035251873_8573
 CONTAINERID=container_1403035251873_8573_01_000001
2014-06-19 03:42:07,512 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode:
Released container container_1403035251873_8573_01_000001 of capacity
<memory:3072, vCores:1> on host node:47221, which currently has 0
containers, <memory:0, vCores:0> used and <memory:24576, vCores:8>
available, release resources=true
2014-06-19 03:42:07,512 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Application attempt appattempt_1403035251873_8573_000001 released container
container_1403035251873_8573_01_000001 on node: host: node:47221
#containers=0 available=<memory:24576, vCores:8> used=<memory:0, vCores:0>
with event: FINISHED
2014-06-19 03:42:07,513 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable:
Node offered to app: application_1403035251873_8573 reserved: false
2014-06-19 03:42:07,513 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
Processing event for appattempt_1403035251873_8573_000001 of type
CONTAINER_FINISHED
2014-06-19 03:42:07,513 INFO
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
Unregistering app attempt : appattempt_1403035251873_8573_000001
2014-06-19 03:42:07,513 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.security.AMRMTokenSecretManager:
Application finished, removing password for
appattempt_1403035251873_8573_000001
2014-06-19 03:42:07,513 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1403035251873_8573_000001 State change from FINISHING to FINISHED
2014-06-19 03:42:07,513 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Processing
event for application_1403035251873_8573 of type ATTEMPT_FINISHED
2014-06-19 03:42:07,513 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
application_1403035251873_8573 State change from FINISHING to FINISHED
2014-06-19 03:42:07,513 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue:
Counting resource from application_1403035251873_8573 <memory:0, vCores:0>;
Total resource consumption for root.orc_queue now <memory:0, vCores:0>
2014-06-19 03:42:07,513 DEBUG
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: RMAppManager
processing event for application_1403035251873_8573 of type APP_COMPLETED
2014-06-19 03:42:07,513 INFO
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher:
Cleaning master appattempt_1403035251873_8573_000001
2014-06-19 03:42:07,513 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root
OPERATION=Application Finished - Succeeded      TARGET=RMAppManager
RESULT=SUCCESS  APPID=application_1403035251873_8573
2014-06-19 03:42:07,522 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary:
appId=application_1403035251873_8573,name=insert into table dns
partit...loading_dns_2(Stage-1),user=root,queue=root.orc_queue,state=FINISHED,trackingUrl=
http://master:8088/proxy/application_1403035251873_8573/jobhistory/job/job_1403035251873_8573,appMasterHost=node,startTime=1403149302300,finishTime=1403149320694,finalStatus=SUCCEEDED
2014-06-19 03:42:07,522 DEBUG
org.apache.hadoop.yarn.server.security.BaseNMTokenSecretManager: creating
password for appattempt_1403035251873_8573_000001 for user root to run on
NM node:47221
2014-06-19 03:42:07,523 DEBUG
org.apache.hadoop.security.UserGroupInformation: PrivilegedAction
as:appattempt_1403035251873_8573_000001 (auth:SIMPLE)
from:org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.getContainerMgrProxy(AMLauncher.java:171)
2014-06-19 03:42:07,524 DEBUG
org.apache.hadoop.security.UserGroupInformation: PrivilegedAction
as:appattempt_1403035251873_8573_000001 (auth:SIMPLE)
from:org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:711)
2014-06-19 03:42:07,529 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(1588438933) connection to node/<node_IP>:47221 from
appattempt_1403035251873_8573_000001: starting, having connections 1
2014-06-19 03:42:07,532 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(1588438933) connection to node/<node_IP>:47221 from
appattempt_1403035251873_8573_000001 sending #17378
2014-06-19 03:42:07,536 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(1588438933) connection to node/<node_IP>:47221 from
appattempt_1403035251873_8573_000001 got value #17378
2014-06-19 03:42:17,533 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(1588438933) connection to node/<node_IP>:47221 from
appattempt_1403035251873_8573_000001: closed
2014-06-19 03:42:17,533 DEBUG org.apache.hadoop.ipc.Client: IPC Client
(1588438933) connection to node/<node_IP>:47221 from
appattempt_1403035251873_8573_000001: stopped, remaining connections 0