You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Andrew Chung (Jira)" <ji...@apache.org> on 2021/05/19 20:43:00 UTC
[jira] [Updated] (YARN-10760) Number of allocated OPPORTUNISTIC containers can dip below 0

     [ https://issues.apache.org/jira/browse/YARN-10760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Chung updated YARN-10760:
--------------------------------
    Description: 
{{AbstractYarnScheduler.completedContainers}} can potentially be called from multiple sources, yet it appears that there are scenarios in which the caller does not hold the appropriate lock, which can lead to the count of {{OpportunisticSchedulerMetrics.AllocatedOContainers}} falling below 0.
To prevent double counting when releasing allocated O containers, a simple fix might be to check if the {{RMContainer}} has already been removed beforehand, though that may not fix the underlying issue that causes the race condition.

Following is "capture" of {{OpportunisticSchedulerMetrics.AllocatedOContainers}} falling below 0 via a JMX query:

{noformat}
{
    "name" : "Hadoop:service=ResourceManager,name=OpportunisticSchedulerMetrics",
    "modelerType" : "OpportunisticSchedulerMetrics",
    "tag.OpportunisticSchedulerMetrics" : "ResourceManager",
    "tag.Context" : "yarn",
    "tag.Hostname" : "",
    "AllocatedOContainers" : -2716,
    "AggregateOContainersAllocated" : 306020,
    "AggregateOContainersReleased" : 308736,
    "AggregateNodeLocalOContainersAllocated" : 0,
    "AggregateRackLocalOContainersAllocated" : 0,
    "AggregateOffSwitchOContainersAllocated" : 306020,
    "AllocateLatencyOQuantilesNumOps" : 0,
    "AllocateLatencyOQuantiles50thPercentileTime" : 0,
    "AllocateLatencyOQuantiles75thPercentileTime" : 0,
    "AllocateLatencyOQuantiles90thPercentileTime" : 0,
    "AllocateLatencyOQuantiles95thPercentileTime" : 0,
    "AllocateLatencyOQuantiles99thPercentileTime" : 0
  }
{noformat}

UPDATE: Upon further investigation, it seems that the culprit is that we are not incrementing AllocatedOContainers when the RM restarts, so the deallocation still decrements the recovered OContainers, but we never increment them on recovery. We have an initial fix for this, and are waiting for verification of the fix.

  was:
{{AbstractYarnScheduler.completedContainers}} can potentially be called from multiple sources, yet it appears that there are scenarios in which the caller does not hold the appropriate lock, which can lead to the count of {{OpportunisticSchedulerMetrics.AllocatedOContainers}} falling below 0.
To prevent double counting when releasing allocated O containers, a simple fix might be to check if the {{RMContainer}} has already been removed beforehand, though that may not fix the underlying issue that causes the race condition.

Following is "capture" of {{OpportunisticSchedulerMetrics.AllocatedOContainers}} falling below 0 via a JMX query:

{noformat}
{
    "name" : "Hadoop:service=ResourceManager,name=OpportunisticSchedulerMetrics",
    "modelerType" : "OpportunisticSchedulerMetrics",
    "tag.OpportunisticSchedulerMetrics" : "ResourceManager",
    "tag.Context" : "yarn",
    "tag.Hostname" : "",
    "AllocatedOContainers" : -2716,
    "AggregateOContainersAllocated" : 306020,
    "AggregateOContainersReleased" : 308736,
    "AggregateNodeLocalOContainersAllocated" : 0,
    "AggregateRackLocalOContainersAllocated" : 0,
    "AggregateOffSwitchOContainersAllocated" : 306020,
    "AllocateLatencyOQuantilesNumOps" : 0,
    "AllocateLatencyOQuantiles50thPercentileTime" : 0,
    "AllocateLatencyOQuantiles75thPercentileTime" : 0,
    "AllocateLatencyOQuantiles90thPercentileTime" : 0,
    "AllocateLatencyOQuantiles95thPercentileTime" : 0,
    "AllocateLatencyOQuantiles99thPercentileTime" : 0
  }
{noformat}


> Number of allocated OPPORTUNISTIC containers can dip below 0
> ------------------------------------------------------------
>
>                 Key: YARN-10760
>                 URL: https://issues.apache.org/jira/browse/YARN-10760
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 3.1.2
>            Reporter: Andrew Chung
>            Assignee: Andrew Chung
>            Priority: Minor
>
> {{AbstractYarnScheduler.completedContainers}} can potentially be called from multiple sources, yet it appears that there are scenarios in which the caller does not hold the appropriate lock, which can lead to the count of {{OpportunisticSchedulerMetrics.AllocatedOContainers}} falling below 0.
> To prevent double counting when releasing allocated O containers, a simple fix might be to check if the {{RMContainer}} has already been removed beforehand, though that may not fix the underlying issue that causes the race condition.
> Following is "capture" of {{OpportunisticSchedulerMetrics.AllocatedOContainers}} falling below 0 via a JMX query:
> {noformat}
> {
>     "name" : "Hadoop:service=ResourceManager,name=OpportunisticSchedulerMetrics",
>     "modelerType" : "OpportunisticSchedulerMetrics",
>     "tag.OpportunisticSchedulerMetrics" : "ResourceManager",
>     "tag.Context" : "yarn",
>     "tag.Hostname" : "",
>     "AllocatedOContainers" : -2716,
>     "AggregateOContainersAllocated" : 306020,
>     "AggregateOContainersReleased" : 308736,
>     "AggregateNodeLocalOContainersAllocated" : 0,
>     "AggregateRackLocalOContainersAllocated" : 0,
>     "AggregateOffSwitchOContainersAllocated" : 306020,
>     "AllocateLatencyOQuantilesNumOps" : 0,
>     "AllocateLatencyOQuantiles50thPercentileTime" : 0,
>     "AllocateLatencyOQuantiles75thPercentileTime" : 0,
>     "AllocateLatencyOQuantiles90thPercentileTime" : 0,
>     "AllocateLatencyOQuantiles95thPercentileTime" : 0,
>     "AllocateLatencyOQuantiles99thPercentileTime" : 0
>   }
> {noformat}
> UPDATE: Upon further investigation, it seems that the culprit is that we are not incrementing AllocatedOContainers when the RM restarts, so the deallocation still decrements the recovered OContainers, but we never increment them on recovery. We have an initial fix for this, and are waiting for verification of the fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org