You are viewing a plain text version of this content. The canonical link for it is here.

Posted to yarn-issues@hadoop.apache.org by "Srikanth Kandula (JIRA)" <ji...@apache.org> on 2015/09/09 21:22:46 UTC

[jira] [Updated] (YARN-4088) RM should be able to process heartbeats from NM concurrently

     [ https://issues.apache.org/jira/browse/YARN-4088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Srikanth Kandula updated YARN-4088:
-----------------------------------
    Summary: RM should be able to process heartbeats from NM concurrently  (was: RM should be able to process heartbeats from NM asynchronously)

> RM should be able to process heartbeats from NM concurrently
> ------------------------------------------------------------
>
>                 Key: YARN-4088
>                 URL: https://issues.apache.org/jira/browse/YARN-4088
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager, scheduler
>            Reporter: Srikanth Kandula
>
> Today, the RM sequentially processes one heartbeat after another. 
> Imagine a 3000 server cluster with each server heart-beating every 3s. This gives the RM 1ms on average to process each NM heartbeat. That is tough.
> It is true that there are several underlying datastructures that will be touched during heartbeat processing. So, it is non-trivial to parallelize the NM heartbeat. Yet, it is quite doable...
> Parallelizing the NM heartbeat would substantially improve the scalability of the RM, allowing it to either 
> a) run larger clusters or 
> b) support faster heartbeats or dynamic scaling of heartbeats
> c) take more asks from each application or 
> c) use cleverer/ more expensive algorithms such as node labels or better packing or ...
> Indeed the RM's scalability limit has been cited as the motivating reason for a variety of efforts which will become less needed if this can be solved. Ditto for slow heartbeats.  See Sparrow and Mercury papers for example.
> Can we take a shot at this?
> If not, could we discuss why.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)