You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-dev@hadoop.apache.org by "Srikanth Kandula (JIRA)" <ji...@apache.org> on 2015/08/27 06:09:45 UTC

[jira] [Created] (YARN-4088) RM should be able to process heartbeats from NM asynchronously

Srikanth Kandula created YARN-4088:
--------------------------------------

             Summary: RM should be able to process heartbeats from NM asynchronously
                 Key: YARN-4088
                 URL: https://issues.apache.org/jira/browse/YARN-4088
             Project: Hadoop YARN
          Issue Type: Improvement
          Components: resourcemanager, scheduler
            Reporter: Srikanth Kandula


Today, the RM sequentially processes one heartbeat after another. 

Imagine a 3000 server cluster with each server heart-beating every 3s. This gives the RM 1ms on average to process each NM heartbeat. That is tough.

It is true that there are several underlying datastructures that will be touched during heartbeat processing. So, it is non-trivial to parallelize the NM heartbeat. Yet, it is quite doable...

Parallelizing the NM heartbeat would substantially improve the scalability of the RM, allowing it to either 
a) run larger clusters or 
b) support faster heartbeats or dynamic scaling of heartbeats
c) take more asks from each application or 
c) use cleverer/ more expensive algorithms such as node labels or better packing or ...

Indeed the RM's scalability limit has been cited as the motivating reason for a variety of efforts which will become less needed if this can be solved. Ditto for slow heartbeats.  See Sparrow and Mercury papers for example.

Can we take a shot at this?
If not, could we discuss why.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)