You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-dev@hadoop.apache.org by Tsuyoshi OZAWA <oz...@gmail.com> on 2012/07/31 03:11:20 UTC

Multi-level aggregation with combining the result of maps per node/rack

Hi,

We consider the shuffle cost is a main concern in MapReduce,
in particular, aggregation processing.
The shuffle costs is also expensive in Hadoop in spite of the
existence of combiner, because the scope of combining is limited
within only one MapTask.

To solve this problem, I've implemented the prototype that
combines the result of multiple maps per node[1].
This is the first step to make hadoop faster with multi-level
aggregation technique like Google Dremel[2].

I took a benchmark with the prototype.
We used WordCount program with in-mapper combining optimization
as the benchmark. The benchmark is taken under 40 nodes [3].
The input data set is 300GB, 500GB, 1TB, and 2TB texts which is generated
by default RandomTextWriter. Reducer is configured
as 1 on the assumption that some workload forces 1 reducer
like Google Dremel. The result is as follows:

                         | 300GB | 500GB |   1TB |   2TB |
            Normal (sec) |  4004 |  5551 | 12177 | 27608 |
Combining per node (sec) |  3678 |  3844 |  7440 | 15591 |

Note that a MapTask runs combiner per node every 3 minutes in
the current prototype, so the aggregation rate is very limited.

"Normal" is the result of current hadoop, and "Combining per node"
is the result with my optimization.  Regardless of the 3-minutes
restriction, the prototype is 1.7 times faster than normal hadoop
in 2TB case.  Another benchmark also shows that the shuffle costs
is cut down by 50%.

I want to know from you guys, do you think is it a useful feature?
If yes, I will work for contributing it.
It is also welcome to tell me the benchmark that you want me to do
with my prototype.

Regards,
Tsuyoshi


[1] The idea is also described in Hadoop wiki:
    http://wiki.apache.org/hadoop/HadoopResearchProjects
[2] Dremel paper is available at:
    http://research.google.com/pubs/pub36632.html
[3] The specification of each nodes is as follows:
    CPU Core(TM)2 Duo CPU E7400 2.80GHz x 2
    Memory 8 GB
    Network 1 GbE

Re: Multi-level aggregation with combining the result of maps per node/rack

Posted by Tsuyoshi OZAWA <oz...@gmail.com>.

Bikas,

Yes, this feature is similar to Dryad approach.
The design of current prototype is a bit hacky to minimize the costs
of implementation.
The modified file is only MapTask.java. Processing flow is as follows:

1. Move the results of output files to temporary directory after doing
mapper function.
2. A leader of the container run combiner each 3 minutes if there are
some files under
the temporary directory.

A leader of the container is elected by file lock. I know that this
design breaks the
fault-tolerance of MapReduce, and isn't acceptable for hadoop. I'll
renew the design
and document it.

> That is useful to reduce the input arity (as you suggest) and also helps with
> reducing the chance of seeing failures.

What does the latter mean? Could you explain with an example?

Thank you for your comment,
Tsuyoshi OZAWA

On Wed, Aug 1, 2012 at 2:32 AM, Bikas Saha <bi...@hortonworks.com> wrote:
> Can you please share a brief note on the design. Just a few sentences on
> the main changes.
>
>
>
> What you are saying sounds similar to multi-level aggregation done in the
> Dryad <http://www.cs.cmu.edu/~./15712/papers/isard07.pdf> runtime. That is
> useful to reduce the input arity (as you suggest) and also helps with
> reducing the chance of seeing failures.
>
>
>
> Bikas
>
>
>
> -----Original Message-----
> From: Tsuyoshi OZAWA [mailto:ozawa.tsuyoshi@gmail.com]
> Sent: Monday, July 30, 2012 6:11 PM
> To: mapreduce-dev@hadoop.apache.org
> Subject: Multi-level aggregation with combining the result of maps per
> node/rack
>
>
>
> Hi,
>
>
>
> We consider the shuffle cost is a main concern in MapReduce, in particular,
> aggregation processing.
>
> The shuffle costs is also expensive in Hadoop in spite of the existence of
> combiner, because the scope of combining is limited within only one MapTask.
>
>
>
> To solve this problem, I've implemented the prototype that combines the
> result of multiple maps per node[1].
>
> This is the first step to make hadoop faster with multi-level aggregation
> technique like Google Dremel[2].
>
>
>
> I took a benchmark with the prototype.
>
> We used WordCount program with in-mapper combining optimization as the
> benchmark. The benchmark is taken under 40 nodes [3].
>
> The input data set is 300GB, 500GB, 1TB, and 2TB texts which is generated
> by default RandomTextWriter. Reducer is configured as 1 on the assumption
> that some workload forces 1 reducer like Google Dremel. The result is as
> follows:
>
>
>
>                          | 300GB | 500GB |   1TB |   2TB |
>
>             Normal (sec) |  4004 |  5551 | 12177 | 27608 | Combining per
> node (sec) |  3678 |  3844 |  7440 | 15591 |
>
>
>
> Note that a MapTask runs combiner per node every 3 minutes in the current
> prototype, so the aggregation rate is very limited.
>
>
>
> "Normal" is the result of current hadoop, and "Combining per node"
>
> is the result with my optimization.  Regardless of the 3-minutes
> restriction, the prototype is 1.7 times faster than normal hadoop in 2TB
> case.  Another benchmark also shows that the shuffle costs is cut down by
> 50%.
>
>
>
> I want to know from you guys, do you think is it a useful feature?
>
> If yes, I will work for contributing it.
>
> It is also welcome to tell me the benchmark that you want me to do with my
> prototype.
>
>
>
> Regards,
>
> Tsuyoshi
>
>
>
>
>
> [1] The idea is also described in Hadoop wiki:
>
>     http://wiki.apache.org/hadoop/HadoopResearchProjects
>
> [2] Dremel paper is available at:
>
>     http://research.google.com/pubs/pub36632.html
>
> [3] The specification of each nodes is as follows:
>
>     CPU Core(TM)2 Duo CPU E7400 2.80GHz x 2
>
>     Memory 8 GB
>
>     Network 1 GbE



-- 
OZAWA Tsuyoshi

RE: Multi-level aggregation with combining the result of maps per node/rack

Posted by Bikas Saha <bi...@hortonworks.com>.

Can you please share a brief note on the design. Just a few sentences on
the main changes.



What you are saying sounds similar to multi-level aggregation done in the
Dryad <http://www.cs.cmu.edu/~./15712/papers/isard07.pdf> runtime. That is
useful to reduce the input arity (as you suggest) and also helps with
reducing the chance of seeing failures.



Bikas



-----Original Message-----
From: Tsuyoshi OZAWA [mailto:ozawa.tsuyoshi@gmail.com]
Sent: Monday, July 30, 2012 6:11 PM
To: mapreduce-dev@hadoop.apache.org
Subject: Multi-level aggregation with combining the result of maps per
node/rack



Hi,



We consider the shuffle cost is a main concern in MapReduce, in particular,
aggregation processing.

The shuffle costs is also expensive in Hadoop in spite of the existence of
combiner, because the scope of combining is limited within only one MapTask.



To solve this problem, I've implemented the prototype that combines the
result of multiple maps per node[1].

This is the first step to make hadoop faster with multi-level aggregation
technique like Google Dremel[2].



I took a benchmark with the prototype.

We used WordCount program with in-mapper combining optimization as the
benchmark. The benchmark is taken under 40 nodes [3].

The input data set is 300GB, 500GB, 1TB, and 2TB texts which is generated
by default RandomTextWriter. Reducer is configured as 1 on the assumption
that some workload forces 1 reducer like Google Dremel. The result is as
follows:



                         | 300GB | 500GB |   1TB |   2TB |

            Normal (sec) |  4004 |  5551 | 12177 | 27608 | Combining per
node (sec) |  3678 |  3844 |  7440 | 15591 |



Note that a MapTask runs combiner per node every 3 minutes in the current
prototype, so the aggregation rate is very limited.



"Normal" is the result of current hadoop, and "Combining per node"

is the result with my optimization.  Regardless of the 3-minutes
restriction, the prototype is 1.7 times faster than normal hadoop in 2TB
case.  Another benchmark also shows that the shuffle costs is cut down by
50%.



I want to know from you guys, do you think is it a useful feature?

If yes, I will work for contributing it.

It is also welcome to tell me the benchmark that you want me to do with my
prototype.



Regards,

Tsuyoshi





[1] The idea is also described in Hadoop wiki:

    http://wiki.apache.org/hadoop/HadoopResearchProjects

[2] Dremel paper is available at:

    http://research.google.com/pubs/pub36632.html

[3] The specification of each nodes is as follows:

    CPU Core(TM)2 Duo CPU E7400 2.80GHz x 2

    Memory 8 GB

    Network 1 GbE

Re: Multi-level aggregation with combining the result of maps per node/rack

Posted by Tsuyoshi OZAWA <oz...@gmail.com>.

Robert,

Thank you for your precious opinion and sharing the related JIRA tickets.

The combination consisting of reusing container (MAPREDUCE-3902) and the
coordination system in the AM is good idea to minimize implementation
cost and ensure fault tolerance. The design can also solve scheduling problem
of when to run combiner.And the current design doesn't matter security against
the intermediate data at all, so I'll consider it.

I'll create a new design note with your opinion in mind, and attach it
on a new JIRA
(MAPREDUCE-4502).

Thanks,
Tsuyoshi OZAWA

On Tue, Jul 31, 2012 at 10:46 PM, Robert Evans <ev...@yahoo-inc.com> wrote:
> Tsuyoshi,
>
>
> There has been a lot of work happening in the shuffle phase.  It is being
> made pluggable in both 1.0 and 2.0/trunk (MAPREDUCE-4049).  There is also
> some work being done to reuse containers in trunk/2.0 (MAPREDUCE-3902).
> This should have a similar, although perhaps more limited result, because
> when different map tasks run in the same container their outputs also go
> through the same combiner.  I have heard that it is showing some good
> results for both small and large jobs.  There was also some work to try
> and pull in Sailfish (No JIRA just ramblings on the mailing list), which
> moves the shuffle phase to a separate process.  I have not seen much
> happen on that front recently, but it saw some large gains on big jobs,
> but is worse on small jobs.  I think that this is something very
> interesting and I would encourage you to file a JIRA and pursue it.
>
> I don't know anything about your design, so please feel free to disregard
> my comments if they do not apply.  I would encourage you to think about
> security on this.  When you run the combiner you need to be sure that it
> runs as the user that owns the data.  This should probably not be too
> difficult if you hijack a mapper tasks that has just finished to try and
> combine the data from others on the same node.  To do this you will
> probably need some sort of a coordination system in the AM to tell that
> mapper what other mappers to try and combine data from.  It would be nice
> to coordinate this with the container reuse work, which currently just
> tells the container to run another split through.  It could be another
> option to tell it to combine with the map output from container X.
>
> Another thing to be aware of is small jobs.  It would be great to see how
> this impacts small jobs, and if it has a negative impact we should look
> for an automated way to turn this off or on.
>
> Thanks for your work,
>
> Bobby Evans
>
> On 7/30/12 8:11 PM, "Tsuyoshi OZAWA" <oz...@gmail.com> wrote:
>
>>Hi,
>>
>>We consider the shuffle cost is a main concern in MapReduce,
>>in particular, aggregation processing.
>>The shuffle costs is also expensive in Hadoop in spite of the
>>existence of combiner, because the scope of combining is limited
>>within only one MapTask.
>>
>>To solve this problem, I've implemented the prototype that
>>combines the result of multiple maps per node[1].
>>This is the first step to make hadoop faster with multi-level
>>aggregation technique like Google Dremel[2].
>>
>>I took a benchmark with the prototype.
>>We used WordCount program with in-mapper combining optimization
>>as the benchmark. The benchmark is taken under 40 nodes [3].
>>The input data set is 300GB, 500GB, 1TB, and 2TB texts which is generated
>>by default RandomTextWriter. Reducer is configured
>>as 1 on the assumption that some workload forces 1 reducer
>>like Google Dremel. The result is as follows:
>>
>>                         | 300GB | 500GB |   1TB |   2TB |
>>            Normal (sec) |  4004 |  5551 | 12177 | 27608 |
>>Combining per node (sec) |  3678 |  3844 |  7440 | 15591 |
>>
>>Note that a MapTask runs combiner per node every 3 minutes in
>>the current prototype, so the aggregation rate is very limited.
>>
>>"Normal" is the result of current hadoop, and "Combining per node"
>>is the result with my optimization.  Regardless of the 3-minutes
>>restriction, the prototype is 1.7 times faster than normal hadoop
>>in 2TB case.  Another benchmark also shows that the shuffle costs
>>is cut down by 50%.
>>
>>I want to know from you guys, do you think is it a useful feature?
>>If yes, I will work for contributing it.
>>It is also welcome to tell me the benchmark that you want me to do
>>with my prototype.
>>
>>Regards,
>>Tsuyoshi
>>
>>
>>[1] The idea is also described in Hadoop wiki:
>>    http://wiki.apache.org/hadoop/HadoopResearchProjects
>>[2] Dremel paper is available at:
>>    http://research.google.com/pubs/pub36632.html
>>[3] The specification of each nodes is as follows:
>>    CPU Core(TM)2 Duo CPU E7400 2.80GHz x 2
>>    Memory 8 GB
>>    Network 1 GbE
>



-- 
OZAWA Tsuyoshi

Re: Multi-level aggregation with combining the result of maps per node/rack

Posted by Robert Evans <ev...@yahoo-inc.com>.

Tsuyoshi,

There has been a lot of work happening in the shuffle phase.  It is being
made pluggable in both 1.0 and 2.0/trunk (MAPREDUCE-4049).  There is also
some work being done to reuse containers in trunk/2.0 (MAPREDUCE-3902).
This should have a similar, although perhaps more limited result, because
when different map tasks run in the same container their outputs also go
through the same combiner.  I have heard that it is showing some good
results for both small and large jobs.  There was also some work to try
and pull in Sailfish (No JIRA just ramblings on the mailing list), which
moves the shuffle phase to a separate process.  I have not seen much
happen on that front recently, but it saw some large gains on big jobs,
but is worse on small jobs.  I think that this is something very
interesting and I would encourage you to file a JIRA and pursue it.

I don't know anything about your design, so please feel free to disregard
my comments if they do not apply.  I would encourage you to think about
security on this.  When you run the combiner you need to be sure that it
runs as the user that owns the data.  This should probably not be too
difficult if you hijack a mapper tasks that has just finished to try and
combine the data from others on the same node.  To do this you will
probably need some sort of a coordination system in the AM to tell that
mapper what other mappers to try and combine data from.  It would be nice
to coordinate this with the container reuse work, which currently just
tells the container to run another split through.  It could be another
option to tell it to combine with the map output from container X.

Another thing to be aware of is small jobs.  It would be great to see how
this impacts small jobs, and if it has a negative impact we should look
for an automated way to turn this off or on.

Thanks for your work,

Bobby Evans

On 7/30/12 8:11 PM, "Tsuyoshi OZAWA" <oz...@gmail.com> wrote:

>Hi,
>
>We consider the shuffle cost is a main concern in MapReduce,
>in particular, aggregation processing.
>The shuffle costs is also expensive in Hadoop in spite of the
>existence of combiner, because the scope of combining is limited
>within only one MapTask.
>
>To solve this problem, I've implemented the prototype that
>combines the result of multiple maps per node[1].
>This is the first step to make hadoop faster with multi-level
>aggregation technique like Google Dremel[2].
>
>I took a benchmark with the prototype.
>We used WordCount program with in-mapper combining optimization
>as the benchmark. The benchmark is taken under 40 nodes [3].
>The input data set is 300GB, 500GB, 1TB, and 2TB texts which is generated
>by default RandomTextWriter. Reducer is configured
>as 1 on the assumption that some workload forces 1 reducer
>like Google Dremel. The result is as follows:
>
>                         | 300GB | 500GB |   1TB |   2TB |
>            Normal (sec) |  4004 |  5551 | 12177 | 27608 |
>Combining per node (sec) |  3678 |  3844 |  7440 | 15591 |
>
>Note that a MapTask runs combiner per node every 3 minutes in
>the current prototype, so the aggregation rate is very limited.
>
>"Normal" is the result of current hadoop, and "Combining per node"
>is the result with my optimization.  Regardless of the 3-minutes
>restriction, the prototype is 1.7 times faster than normal hadoop
>in 2TB case.  Another benchmark also shows that the shuffle costs
>is cut down by 50%.
>
>I want to know from you guys, do you think is it a useful feature?
>If yes, I will work for contributing it.
>It is also welcome to tell me the benchmark that you want me to do
>with my prototype.
>
>Regards,
>Tsuyoshi
>
>
>[1] The idea is also described in Hadoop wiki:
>    http://wiki.apache.org/hadoop/HadoopResearchProjects
>[2] Dremel paper is available at:
>    http://research.google.com/pubs/pub36632.html
>[3] The specification of each nodes is as follows:
>    CPU Core(TM)2 Duo CPU E7400 2.80GHz x 2
>    Memory 8 GB
>    Network 1 GbE