You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Sunil G (JIRA)" <ji...@apache.org> on 2015/06/25 18:24:04 UTC
[jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues

    [ https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14601475#comment-14601475 ] 

Sunil G commented on YARN-3849:
-------------------------------

Looping [~rohithsharma] and [~leftnoteasy]

Since we use Dominant resource calculator, below piece of code in ProportionalPreemptionPolicy looks doubtful

{code}
      // When we have no more resource need to obtain, remove from map.
      if (Resources.lessThanOrEqual(rc, clusterResource, toObtainByPartition,
          Resources.none())) {
        resourceToObtainByPartitions.remove(nodePartition);
      }
{code}

Assume toObtainByPartition is <12, 1> ()memory, core). After another round of preemption, this will become <10, 0>.
If the above check hits with this value, its supposed to return TRUE. But the method returns FALSE.

Reason is that due to dominance, if any resource item is non-zero then that is returned as true.

{code}
// Just use 'dominant' resource
    return (dominant) ?
        Math.max(
            (float)resource.getMemory() / clusterResource.getMemory(), 
            (float)resource.getVirtualCores() / clusterResource.getVirtualCores()
            ) 
        :
          Math.min(
              (float)resource.getMemory() / clusterResource.getMemory(), 
              (float)resource.getVirtualCores() / clusterResource.getVirtualCores()
              ); 
{code}

If resource.getVirtualCores() is ZERO and resource.getMemory() is Non-Zero, then this check will return +ve. 
We feel that this has to be checked prior and if one item is ZERO, we have to say lhs is lesser to rhs.

> Too much of preemption activity causing continuos killing of containers across queues
> -------------------------------------------------------------------------------------
>
>                 Key: YARN-3849
>                 URL: https://issues.apache.org/jira/browse/YARN-3849
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.7.0
>            Reporter: Sunil G
>            Assignee: Sunil G
>            Priority: Critical
>
> Two queues are used. Each queue has given a capacity of 0.5. Dominant Resource policy is used.
> 1. An app is submitted in QueueA which is consuming full cluster capacity
> 2. After submitting an app in QueueB, there are some demand  and invoking preemption in QueueA
> 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that all containers other than AM is getting killed in QueueA
> 4. Now the app in QueueB is trying to take over cluster with the current free space. But there are some updated demand from the app in QueueA which lost its containers earlier, and preemption is kicked in QueueB now.
> Scenario in step 3 and 4 continuously happening in loop. Thus none of the apps are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)