You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Peter D Kirchner (JIRA)" <ji...@apache.org> on 2015/02/10 22:49:12 UTC
[jira] [Commented] (YARN-3020) n similar addContainerRequest()s produce n*(n+1)/2 containers

    [ https://issues.apache.org/jira/browse/YARN-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14315013#comment-14315013 ] 

Peter D Kirchner commented on YARN-3020:
----------------------------------------

Hi Wei Yan,
My point, adjusted to take the "expected usage" into account, is that when matching requests and/or allocations are spread over multiple heartbeats, too many containers are requested and received.

So, suppose my application calls addContainerRequest() 10 times.

Let's take your example where the AMRMClient sends 1 container request on heartbeat 1, and 10 requests at heartbeat 2, overwriting the 1.
Say also that the second RPC returns with 1 container.

The second request is high by one, i.e. 10, because the application does not yet know about the incoming allocation.
Subsequent updates are also high by approximately the number of incoming containers.
My application heartbeat is 1 second and the RM is typically allocating 1 container/node/second so I'd expect 10 containers coming in on the third heartbeat.
Per expected usage, my AMRMClient would have sent out an updated request for 9 containers at that time.
My application would zero-out the matching request on the fourth heartbeat and release the nine extra containers (90% more) that it received that it never intended to request.  

In the present implementation, with the AMRMClient keeping track of the totals, removeContainerRequest() properly decrements AMRMClient's idea of the outstanding count.
But due to this information being a heartbeat out of date vs. the scheduler's, (pending a definitive fix) a partial fix would be that the AMRMClient should not routinely update the RM with this matching total, whenever the scheduler's tally is likely to be more accurate.
Occasions when the RM should be updated are when there is a new matching addContainerRequest(), i.e. the scheduler's target could otherwise be too low, or when the AMRMClient's outstanding count is decremented to zero.


Please see my response to Wangda Tan 30 Jan 2015.
Thank you.

> n similar addContainerRequest()s produce n*(n+1)/2 containers
> -------------------------------------------------------------
>
>                 Key: YARN-3020
>                 URL: https://issues.apache.org/jira/browse/YARN-3020
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 2.5.0, 2.6.0, 2.5.1, 2.5.2
>            Reporter: Peter D Kirchner
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> BUG: If the application master calls addContainerRequest() n times, but with the same priority, I get up to 1+2+3+...+n containers = n*(n+1)/2 .  The most containers are requested when the interval between calls to addContainerRequest() exceeds the heartbeat interval of calls to allocate() (in AMRMClientImpl's run() method).
> If the application master calls addContainerRequest() n times, but with a unique priority each time, I get n containers (as I intended).
> Analysis:
> There is a logic problem in AMRMClientImpl.java.
> Although AMRMClientImpl.java, allocate() does an ask.clear() , on subsequent calls to addContainerRequest(), addResourceRequest() finds the previous matching remoteRequest and increments the container count rather than starting anew, and does an addResourceRequestToAsk() which defeats the ask.clear().
> From documentation and code comments, it was hard for me to discern the intended behavior of the API, but the inconsistency reported in this issue suggests one case or the other is implemented incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)