You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-dev@hadoop.apache.org by Jie Li <ji...@cs.duke.edu> on 2012/01/16 20:41:31 UTC

How is MRv2 fundamentally changed?

Hi all,

As we know MRv2 (the MapReduce library in YARN) has changed significantly.
We have a cost model built for the MapReduce in Hadoop and are going to
migrate to MRv2. Can anyone give us a pointer to the fundamental
differences between them? Also, below are some of my understandings and
feel free to correct me.

1. JT has been replaced by a central RM and a per-application AM.
2. TT has been replaced by the NM and the task slots have been replaced by
the containers. The containers can be allocated dynamically thus both the
number and the memory size of the containers can vary on demand.
3. The shuffle service has become independent from the Map.

Thanks,
Jie

Re: How is MRv2 fundamentally changed?

Posted by Jie Li <ji...@cs.duke.edu>.

Hi Vinod,

Thanks a lot for the info. That's good to know!

Yeah our model is based on 0.20. Could you possibly give us a pointer to
the main changes related to the map/shuffle/reduce phases since 0.20? We'll
be excited to extend our cost-based optimization for YARN.

Jie

On Tue, Jan 24, 2012 at 12:02 AM, Vinod Kumar Vavilapalli <
vinodkv@hortonworks.com> wrote:

> On Mon, Jan 23, 2012 at 5:11 PM, Jie Li <ji...@cs.duke.edu> wrote:
> > What we are looking for, is more of the difference at the task level.
> > Suppose a map task takes 10 minutes in Hadoop, then we have a model to
> > analyse what makes up the 10 minutes, e.g. reading from HDFS, invoking
> the
> > map function, writing to the buffer, partitioning, sorting and merging.
> > This model can be used to identify the bottleneck of the task execution
> and
> > suggest better configurations.
>
>
> The task run time hasn't changed from 0.21/0.22. But it has changed if
> you compare with 0.20, the new runtime has a lot of performance
> improvements and is expected to be better with all the optimizations.
> To answer your question, yes your 'model' shouldn't need any changes.
>
>
> > If we run MR jobs in YARN, can we use the same model to analyse the
> running
> > time of a task? One possible difference I've noticed so far is that the
> > shuffling has become a service of the node manager. Any other change
> > related to the map phase or reduce phase?
>
> Shuffle used to be part of the TaskTracker, it is now in the NM.
> Except that, there isn't much difference that should affect you.
>
> HTH,
> +Vinod
>
>

Re: How is MRv2 fundamentally changed?

Posted by Vinod Kumar Vavilapalli <vi...@hortonworks.com>.

On Mon, Jan 23, 2012 at 5:11 PM, Jie Li <ji...@cs.duke.edu> wrote:
> What we are looking for, is more of the difference at the task level.
> Suppose a map task takes 10 minutes in Hadoop, then we have a model to
> analyse what makes up the 10 minutes, e.g. reading from HDFS, invoking the
> map function, writing to the buffer, partitioning, sorting and merging.
> This model can be used to identify the bottleneck of the task execution and
> suggest better configurations.

The task run time hasn't changed from 0.21/0.22. But it has changed if
you compare with 0.20, the new runtime has a lot of performance
improvements and is expected to be better with all the optimizations.
To answer your question, yes your 'model' shouldn't need any changes.

> If we run MR jobs in YARN, can we use the same model to analyse the running
> time of a task? One possible difference I've noticed so far is that the
> shuffling has become a service of the node manager. Any other change
> related to the map phase or reduce phase?

Shuffle used to be part of the TaskTracker, it is now in the NM.
Except that, there isn't much difference that should affect you.

HTH,
+Vinod

Re: How is MRv2 fundamentally changed?

Posted by Jie Li <ji...@cs.duke.edu>.

Hi Mahadev,

Thanks, they are both very helpful to understand the architecture of YARN.

What we are looking for, is more of the difference at the task level.
Suppose a map task takes 10 minutes in Hadoop, then we have a model to
analyse what makes up the 10 minutes, e.g. reading from HDFS, invoking the
map function, writing to the buffer, partitioning, sorting and merging.
This model can be used to identify the bottleneck of the task execution and
suggest better configurations.

If we run MR jobs in YARN, can we use the same model to analyse the running
time of a task? One possible difference I've noticed so far is that the
shuffling has become a service of the node manager. Any other change
related to the map phase or reduce phase?

Thanks,
Jie

On Mon, Jan 16, 2012 at 4:32 PM, Mahadev Konar <ma...@hortonworks.com>wrote:

> Hi Jie,
>  You might want to read through:
>
> http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/YARN.html
> and
> http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/
>
> for more information on the architecture. Itll help you understand the
> major differences between the two.
>
> mahadev
>
> On Mon, Jan 16, 2012 at 11:41 AM, Jie Li <ji...@cs.duke.edu> wrote:
> > Hi all,
> >
> > As we know MRv2 (the MapReduce library in YARN) has changed
> significantly.
> > We have a cost model built for the MapReduce in Hadoop and are going to
> > migrate to MRv2. Can anyone give us a pointer to the fundamental
> > differences between them? Also, below are some of my understandings and
> > feel free to correct me.
> >
> > 1. JT has been replaced by a central RM and a per-application AM.
> > 2. TT has been replaced by the NM and the task slots have been replaced
> by
> > the containers. The containers can be allocated dynamically thus both the
> > number and the memory size of the containers can vary on demand.
> > 3. The shuffle service has become independent from the Map.
> >
> > Thanks,
> > Jie
>
>

Re: How is MRv2 fundamentally changed?

Posted by Mahadev Konar <ma...@hortonworks.com>.

Hi Jie,
 You might want to read through:
http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/YARN.html
and http://developer.yahoo.com/blogs/hadoop/posts/2011/02/mapreduce-nextgen/

for more information on the architecture. Itll help you understand the
major differences between the two.

mahadev

On Mon, Jan 16, 2012 at 11:41 AM, Jie Li <ji...@cs.duke.edu> wrote:
> Hi all,
>
> As we know MRv2 (the MapReduce library in YARN) has changed significantly.
> We have a cost model built for the MapReduce in Hadoop and are going to
> migrate to MRv2. Can anyone give us a pointer to the fundamental
> differences between them? Also, below are some of my understandings and
> feel free to correct me.
>
> 1. JT has been replaced by a central RM and a per-application AM.
> 2. TT has been replaced by the NM and the task slots have been replaced by
> the containers. The containers can be allocated dynamically thus both the
> number and the memory size of the containers can vary on demand.
> 3. The shuffle service has become independent from the Map.
>
> Thanks,
> Jie