You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Srikanth Kandula (JIRA)" <ji...@apache.org> on 2015/09/02 20:17:46 UTC

[jira] [Commented] (YARN-4088) RM should be able to process heartbeats from NM asynchronously

    [ https://issues.apache.org/jira/browse/YARN-4088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14727761#comment-14727761 ] 

Srikanth Kandula commented on YARN-4088:
----------------------------------------

True. a) Not sure if this (out-of-band heartbeat upon container completion) happens today. b) Processing one NM at a time is unlikely to cope well with the storms of heartbeats.

> RM should be able to process heartbeats from NM asynchronously
> --------------------------------------------------------------
>
>                 Key: YARN-4088
>                 URL: https://issues.apache.org/jira/browse/YARN-4088
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager, scheduler
>            Reporter: Srikanth Kandula
>
> Today, the RM sequentially processes one heartbeat after another. 
> Imagine a 3000 server cluster with each server heart-beating every 3s. This gives the RM 1ms on average to process each NM heartbeat. That is tough.
> It is true that there are several underlying datastructures that will be touched during heartbeat processing. So, it is non-trivial to parallelize the NM heartbeat. Yet, it is quite doable...
> Parallelizing the NM heartbeat would substantially improve the scalability of the RM, allowing it to either 
> a) run larger clusters or 
> b) support faster heartbeats or dynamic scaling of heartbeats
> c) take more asks from each application or 
> c) use cleverer/ more expensive algorithms such as node labels or better packing or ...
> Indeed the RM's scalability limit has been cited as the motivating reason for a variety of efforts which will become less needed if this can be solved. Ditto for slow heartbeats.  See Sparrow and Mercury papers for example.
> Can we take a shot at this?
> If not, could we discuss why.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)