You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Praveen Sripati <pr...@gmail.com> on 2012/01/05 17:29:58 UTC

Queries on next gen MR architecture

Hi,

I had been going through the MRv2 documentation and have the following
queries

1) Let's say that an InputSplit is on Node1 and Node2.

Can the ApplicationMaster ask the ResourceManager for a container either on
Node1 or Node2 with an OR condition?

2) > The Scheduler receives periodic information about the resource usages
on allocated resources from the NodeManagers. The Scheduler also makes
available status of completed Containers to the appropriate
ApplicationMaster.

What's the use of NM sending the resource usages to the scheduler?

Why can't the NM directly talk to the AM about the completed containers?
Does any information pass from NM to AM?

3) >The Map-Reduce ApplicationMaster has the following components:
> TaskUmbilical – The component responsible for receiving heartbeats and
status updates form the map and reduce tasks.

Does the communication happen directly between the container and the AM? If
yes, the task completion status could also be sent from the container to
the AM.

4) > The Hadoop Map-Reduce JobClient polls the ASM to obtain information
about the MR AM and then directly talks to the AM for status, counters etc.

Once the Job is completed the AM goes down, what happens to the Counters?
What is the flow of the Counter (Container -> NM -> AM)?

5) If a new YARN application is created. How can the NM trust the request
from AM?

6) > MapReduce NextGen uses wire-compatible protocols to allow different
versions of servers and clients to communicate with each other.

What is meant by the `wire-compatible protocols` and how is it implemented?

7) > The computation framework (ResourceManager and NodeManager) is
completely generic and is free of MapReduce specificities.

Is this the reason for adding auxiliary services for shuffling to the NM?

Regards,
Praveen

Re: Queries on next gen MR architecture

Posted by Arun C Murthy <ac...@hortonworks.com>.

On Jan 7, 2012, at 6:47 PM, Praveen Sripati wrote:

> Thanks for the response.
> 
> I was just thinking why some of the design decisions were made with MRv2.
> 
> > No, the OR condition is implied by the hierarchy of requests (node, rack, *).
> 
> If InputSplit1 is on Node11 and Node12 and InputSplit2 on Node21 and Node22. Then the AM can ask for 1 containers on each of the nodes and * as 2 for map tasks. Then the RM can return  2 nodes on Node11 and make * as 0. The data locality is lost for InputSplit2 or else the AM has to make another call to RM releasing one of the container and asking for another container.

Remember, you also have racks information to guide the RM...

> A bit more complex request specifying the dependencies might be more effective.

At a very high cost - it's very expensive for the RM to track splits for each task across nodes & racks. To the extent possible, our goal has been to push work to the AM and keep the RM (and NM) really simple to scale & perform well.

> 
> > NM doesn't make any 'out' calls to anyone by RM, else it would be severe scalability bottleneck.
> 
> There is already a one-way communication between the AM and NM for launching the containers. The response can from the NM can hold the list of completed containers from the previous call.
> 

Again, we want too keep the framework (RM/NM) really simple. So, the task can communicate it's status to the AM itself. 

> > All interactions (RPCs) are authenticated. Also, there is a container token provided by the RM (during allocation) which is verified by the NM during container launch.
> 
> So, a shared key has to be deployed manually on all the nodes for the NM?

No, it's automatically shared on startup between the daemons.

Arun

Re: Queries on next gen MR architecture

Posted by Praveen Sripati <pr...@gmail.com>.

Thanks for the response.

I was just thinking why some of the design decisions were made with MRv2.

> No, the OR condition is implied by the hierarchy of requests (node, rack,
*).

If InputSplit1 is on Node11 and Node12 and InputSplit2 on Node21 and
Node22. Then the AM can ask for 1 containers on each of the nodes and * as
2 for map tasks. Then the RM can return  2 nodes on Node11 and make * as 0.
The data locality is lost for InputSplit2 or else the AM has to make
another call to RM releasing one of the container and asking for another
container. A bit more complex request specifying the dependencies might be
more effective.

> NM doesn't make any 'out' calls to anyone by RM, else it would be severe
scalability bottleneck.

There is already a one-way communication between the AM and NM for
launching the containers. The response can from the NM can hold the list of
completed containers from the previous call.

> All interactions (RPCs) are authenticated. Also, there is a container
token provided by the RM (during allocation) which is verified by the NM
during container launch.

So, a shared key has to be deployed manually on all the nodes for the NM?

Regards,
Praveen

On Sun, Jan 8, 2012 at 12:08 AM, Arun C Murthy <ac...@hortonworks.com> wrote:

>
> On Jan 5, 2012, at 8:29 AM, Praveen Sripati wrote:
>
> Hi,
>
> I had been going through the MRv2 documentation and have the following
> queries
>
> 1) Let's say that an InputSplit is on Node1 and Node2.
>
> Can the ApplicationMaster ask the ResourceManager for a container either
> on Node1 or Node2 with an OR condition?
>
>
> No, the OR condition is implied by the hierarchy of requests (node, rack,
> *).
>
> In this case, assuming topology is node1/rack1 node2/rack1, requests would
> be:
> node1 -> 1
> node2 -> 1
> rack1 -> 1
> * -> 1
>
> OTOH, if the topology is node1/rack1, node2/rack2, requests would be:
> node1 -> 1
> node2 -> 1
> rack1 -> 1
> rack2 -> 1
> * -> 1
>
> In both cases, * would limit the #allocated-containers to '1'.
>
> In the first case rack1 itself (independent of *) would limit
> #allocated-containers to 1.
>
> More details here:
> http://developer.yahoo.com/blogs/hadoop/posts/2011/03/mapreduce-nextgen-scheduler/
> .
>
> I'll work on incorporating this into our docs on hadoop.apache.org.
>
> 2) > The Scheduler receives periodic information about the resource usages
> on allocated resources from the NodeManagers. The Scheduler also makes
> available status of completed Containers to the appropriate
> ApplicationMaster.
>
> What's the use of NM sending the resource usages to the scheduler?
>
> Why can't the NM directly talk to the AM about the completed containers?
> Does any information pass from NM to AM?
>
>
> The NM sends resource usages to the scheduler to allow it to track
> resource utilization on each node and, in future, make smarter decisions
> about allocating extra containers on under-utilized nodes etc.
>
> NM doesn't make any 'out' calls to anyone by RM, else it would be severe
> scalability bottleneck.
>
> 3) >The Map-Reduce ApplicationMaster has the following components:
> > TaskUmbilical – The component responsible for receiving heartbeats and
> status updates form the map and reduce tasks.
>
> Does the communication happen directly between the container and the AM?If yes, the task completion status could also be sent from the container to
> the AM.
>
>
> Yes, it already is sent to AM.
>
> 4) > The Hadoop Map-Reduce JobClient polls the ASM to obtain information
> about the MR AM and then directly talks to the AM for status, counters etc.
>
> Once the Job is completed the AM goes down, what happens to the Counters?
> What is the flow of the Counter (Container -> NM -> AM)?
>
>
> Once jobs completes the Counters etc. are stored in JobHistory file (one
> per job) which is served up, if necessary, by the JobHistoryServer.
>
> 5) If a new YARN application is created. How can the NM trust the request
> from AM?
>
>
> All interactions (RPCs) are authenticated. Also, there is a container
> token provided by the RM (during allocation) which is verified by the NM
> during container launch.
>
> 6) > MapReduce NextGen uses wire-compatible protocols to allow different
> versions of servers and clients to communicate with each other.
>
> What is meant by the `wire-compatible protocols` and how is it
> implemented?
>
>
> We use PB everywhere.
>
> 7) > The computation framework (ResourceManager and NodeManager) is
> completely generic and is free of MapReduce specificities.
>
> Is this the reason for adding auxiliary services for shuffling to the NM?
>
>
> Yes.
>
> hth,
> Arun
>

Re: Queries on next gen MR architecture

Posted by Arun C Murthy <ac...@hortonworks.com>.

On Jan 5, 2012, at 8:29 AM, Praveen Sripati wrote:

> Hi,
> 
> I had been going through the MRv2 documentation and have the following queries
> 
> 1) Let's say that an InputSplit is on Node1 and Node2.
> 
> Can the ApplicationMaster ask the ResourceManager for a container either on Node1 or Node2 with an OR condition?
> 

No, the OR condition is implied by the hierarchy of requests (node, rack, *).

In this case, assuming topology is node1/rack1 node2/rack1, requests would be:
node1 -> 1
node2 -> 1
rack1 -> 1
* -> 1

OTOH, if the topology is node1/rack1, node2/rack2, requests would be:
node1 -> 1
node2 -> 1
rack1 -> 1
rack2 -> 1
* -> 1

In both cases, * would limit the #allocated-containers to '1'.

In the first case rack1 itself (independent of *) would limit #allocated-containers to 1.

More details here: http://developer.yahoo.com/blogs/hadoop/posts/2011/03/mapreduce-nextgen-scheduler/. 

I'll work on incorporating this into our docs on hadoop.apache.org.

> 2) > The Scheduler receives periodic information about the resource usages on allocated resources from the NodeManagers. The Scheduler also makes available status of completed Containers to the appropriate ApplicationMaster.
> 
> What's the use of NM sending the resource usages to the scheduler?
> 
> Why can't the NM directly talk to the AM about the completed containers? Does any information pass from NM to AM?
> 

The NM sends resource usages to the scheduler to allow it to track resource utilization on each node and, in future, make smarter decisions about allocating extra containers on under-utilized nodes etc.

NM doesn't make any 'out' calls to anyone by RM, else it would be severe scalability bottleneck.

> 3) >The Map-Reduce ApplicationMaster has the following components:
> > TaskUmbilical – The component responsible for receiving heartbeats and status updates form the map and reduce tasks.
> 
> Does the communication happen directly between the container and the AM? If yes, the task completion status could also be sent from the container to the AM.
> 

Yes, it already is sent to AM.

> 4) > The Hadoop Map-Reduce JobClient polls the ASM to obtain information about the MR AM and then directly talks to the AM for status, counters etc.
> 
> Once the Job is completed the AM goes down, what happens to the Counters? What is the flow of the Counter (Container -> NM -> AM)?
> 

Once jobs completes the Counters etc. are stored in JobHistory file (one per job) which is served up, if necessary, by the JobHistoryServer.

> 5) If a new YARN application is created. How can the NM trust the request from AM?
> 

All interactions (RPCs) are authenticated. Also, there is a container token provided by the RM (during allocation) which is verified by the NM during container launch.

> 6) > MapReduce NextGen uses wire-compatible protocols to allow different versions of servers and clients to communicate with each other.
> 
> What is meant by the `wire-compatible protocols` and how is it implemented?
> 

We use PB everywhere.

> 7) > The computation framework (ResourceManager and NodeManager) is completely generic and is free of MapReduce specificities.
> 
> Is this the reason for adding auxiliary services for shuffling to the NM?
> 

Yes.

hth,
Arun

Re: Queries on next gen MR architecture

Posted by Praveen Sripati <pr...@gmail.com>.

Could someone please clarify on the below queries?

Regards,
Praveen

On Thu, Jan 5, 2012 at 9:59 PM, Praveen Sripati <pr...@gmail.com>wrote:

> Hi,
>
> I had been going through the MRv2 documentation and have the following
> queries
>
> 1) Let's say that an InputSplit is on Node1 and Node2.
>
> Can the ApplicationMaster ask the ResourceManager for a container either
> on Node1 or Node2 with an OR condition?
>
> 2) > The Scheduler receives periodic information about the resource usages
> on allocated resources from the NodeManagers. The Scheduler also makes
> available status of completed Containers to the appropriate
> ApplicationMaster.
>
> What's the use of NM sending the resource usages to the scheduler?
>
> Why can't the NM directly talk to the AM about the completed containers?
> Does any information pass from NM to AM?
>
> 3) >The Map-Reduce ApplicationMaster has the following components:
> > TaskUmbilical – The component responsible for receiving heartbeats and
> status updates form the map and reduce tasks.
>
> Does the communication happen directly between the container and the AM?If yes, the task completion status could also be sent from the container to
> the AM.
>
> 4) > The Hadoop Map-Reduce JobClient polls the ASM to obtain information
> about the MR AM and then directly talks to the AM for status, counters etc.
>
> Once the Job is completed the AM goes down, what happens to the Counters?
> What is the flow of the Counter (Container -> NM -> AM)?
>
> 5) If a new YARN application is created. How can the NM trust the request
> from AM?
>
> 6) > MapReduce NextGen uses wire-compatible protocols to allow different
> versions of servers and clients to communicate with each other.
>
> What is meant by the `wire-compatible protocols` and how is it
> implemented?
>
> 7) > The computation framework (ResourceManager and NodeManager) is
> completely generic and is free of MapReduce specificities.
>
> Is this the reason for adding auxiliary services for shuffling to the NM?
>
> Regards,
> Praveen
>