You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Andrew Wang (JIRA)" <ji...@apache.org> on 2017/12/13 21:13:00 UTC
[jira] [Updated] (YARN-7560) Resourcemanager hangs when
resourceUsedWithWeightToResourceRatio return a overflow value
[ https://issues.apache.org/jira/browse/YARN-7560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Wang updated YARN-7560:
------------------------------
Fix Version/s: (was: 3.0.0)
3.0.1
> Resourcemanager hangs when resourceUsedWithWeightToResourceRatio return a overflow value
> ------------------------------------------------------------------------------------------
>
> Key: YARN-7560
> URL: https://issues.apache.org/jira/browse/YARN-7560
> Project: Hadoop YARN
> Issue Type: Bug
> Components: fairscheduler, resourcemanager
> Affects Versions: 3.0.0
> Reporter: zhengchenyu
> Assignee: zhengchenyu
> Fix For: 3.0.1
>
> Attachments: YARN-7560.000.patch, YARN-7560.001.patch
>
>
> In our cluster, we changed the configuration, then refreshQueues, we found the resourcemanager hangs. And the Resourcemanager can't restart successfully. We got jstack information, always show like this:
> {code}
> "main" #1 prio=5 os_prio=0 tid=0x00007f98e8017000 nid=0x2f5 runnable [0x00007f98eed9a000]
> java.lang.Thread.State: RUNNABLE
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:182)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSharesInternal(ComputeFairShares.java:140)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeSteadyShares(ComputeFairShares.java:66)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeSteadyShares(FairSharePolicy.java:148)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeSteadyShares(FSParentQueue.java:102)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getQueue(QueueManager.java:148)
> - locked <0x00007f8c4a8177a0> (a java.util.HashMap)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.getLeafQueue(QueueManager.java:101)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.QueueManager.updateAllocationConfiguration(QueueManager.java:387)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$AllocationReloadListener.onReload(FairScheduler.java:1728)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService.reloadAllocations(AllocationFileLoaderService.java:422)
> - locked <0x00007f8c4a7eb2e0> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AllocationFileLoaderService)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.initScheduler(FairScheduler.java:1597)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.serviceInit(FairScheduler.java:1621)
> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x00007f8c4a76ac48> (a java.lang.Object)
> at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:569)
> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x00007f8c49254268> (a java.lang.Object)
> at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:997)
> at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:257)
> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> - locked <0x00007f8c467495e0> (a java.lang.Object)
> at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1220)
> {code}
> When we debug the cluster, we found resourceUsedWithWeightToResourceRatio return a negative value. So the loop can't return. We found in our cluster, the sum of all minRes is over int.max, so resourceUsedWithWeightToResourceRatio return a negative value.
> below is the loop. Because totalResource is long, so always postive. But resourceUsedWithWeightToResourceRatio return int type. Our cluster is so big that resourceUsedWithWeightToResourceRatio will return a overflow value, just a negative. So the loop will never break.
> {code}
> while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type)
> < totalResource) {
> rMax *= 2.0;
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org