You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "qingwu.fu (JIRA)" <ji...@apache.org> on 2013/11/29 07:35:36 UTC

[jira] [Commented] (YARN-1458) hadoop2.2.0 fairscheduler ResourceManager Event Processor thread blocked

    [ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13835202#comment-13835202 ] 

qingwu.fu commented on YARN-1458:
---------------------------------

hi all,
     Here's other phenomena :
     1.       If some one submit job, the resourcemanager accepts it, but the job doesn’t run. In the meantime, the resourcemanager print a lot logs like “2013-11-27 14:27:02,258 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Request for appInfo of unknown attemptappattempt_1384743376038_1121_000001”, and the fairscheduler doesn’t print hearbeat log to ${HADOOP_HOME}/logs/fairscheduler/hadoop-{user}-fairscheduler.log
     2.       The fairscheduler ui can’t be opened and response 500 error.

And here's resourcemanager log when this error appearing:
 Here's  normal logs:
2013-11-27 14:25:36,515 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Application with id 1120 submitted by user root
2013-11-27 14:25:36,515 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Storing application with id application_1384743376038_1120
2013-11-27 14:25:36,515 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root     IP=192.168.24.101       OPERATION=Submit Application Request    TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1384743376038_1120
2013-11-27 14:25:36,515 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1384743376038_1120 State change from NEW to NEW_SAVING
2013-11-27 14:25:36,515 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing info for app: application_1384743376038_1120
2013-11-27 14:25:36,516 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1384743376038_1120 State change from NEW_SAVING to SUBMITTED
2013-11-27 14:25:36,516 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1384743376038_1120_000001
2013-11-27 14:25:36,516 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1384743376038_1120_000001 State change from NEW to SUBMITTED
2013-11-27 14:25:36,516 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application Submission: appattempt_1384743376038_1120_000001, user: root, currently active: 2
2013-11-27 14:25:36,516 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1384743376038_1120_000001 State change from SUBMITTED to SCHEDULED
2013-11-27 14:25:36,516 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1384743376038_1120 State change from SUBMITTED to ACCEPTED
2013-11-27 14:25:36,816 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable: Node offered to app: application_1384743376038_1120 reserved: false
 
 
Abnormal logs:  these logs doesn’t contain the log like :
 2013-11-27 14:25:36,516 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application Submission: appattempt_1384743376038_1120_000001, user: root, currently active: 2
 
Here is abnormal logs:
2013-11-27 14:27:01,391 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated new applicationId: 1122
2013-11-27 14:27:01,391 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated new applicationId: 1121
2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Application with id 1121 submitted by user yangping.wu
2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Application with id 1122 submitted by user yangping.wu
2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yangping.wu      IP=192.168.24.101 OPERATION=Submit Application Request    TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1384743376038_1121
2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Storing application with id application_1384743376038_1122
2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yangping.wu      IP=192.168.24.101
       OPERATION=Submit Application Request    TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1384743376038_1122
2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1384743376038_1122 State change from NEW to NEW_SAVING
2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Storing application with id application_1384743376038_1121
2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing info for app: application_1384743376038_1122
2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1384743376038_1121 State change from NEW to NEW_SAVING
2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing info for app: application_1384743376038_1121
2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1384743376038_1122 State change from NEW_SAVING to SUBMITTED
2013-11-27 14:27:02,253 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1384743376038_1121 State change from NEW_SAVING to SUBMITTED
2013-11-27 14:27:02,253 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1384743376038_1122_000001
2013-11-27 14:27:02,253 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1384743376038_1122_000001 State change from NEW to SUBMITTED
2013-11-27 14:27:02,253 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1384743376038_1121_000001
2013-11-27 14:27:02,253 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1384743376038_1121_000001 State change from NEW to SUBMITTED
2013-11-27 14:27:02,258 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Request for appInfo of unknown attemptappattempt_1384743376038_1122_000001
2013-11-27 14:27:02,258 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Request for appInfo of unknown attemptappattempt_1384743376038_1121_000001



> hadoop2.2.0 fairscheduler ResourceManager Event Processor thread blocked
> ------------------------------------------------------------------------
>
>                 Key: YARN-1458
>                 URL: https://issues.apache.org/jira/browse/YARN-1458
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 2.2.0
>         Environment: Centos 2.6.18-238.19.1.el5 X86_64
> hadoop2.2.0
>            Reporter: qingwu.fu
>              Labels: patch
>   Original Estimate: 408h
>  Remaining Estimate: 408h
>
> The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The output of  jstack command on resourcemanager pid:
>  "ResourceManager Event Processor" prio=10 tid=0x00002aaab0c5f000 nid=0x5dd3 waiting for monitor entry [0x0000000043aa9000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
>         - waiting to lock <0x000000070026b6e0> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
>         at java.lang.Thread.run(Thread.java:744)
> ……
> "FairSchedulerUpdateThread" daemon prio=10 tid=0x00002aaab0a2c800 nid=0x5dc8 runnable [0x00000000433a2000]
>    java.lang.Thread.State: RUNNABLE
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
>         - locked <0x000000070026b6e0> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
>         - locked <0x000000070026b6e0> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
>         at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.1#6144)