You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by Mike Schliep <sc...@cs.umn.edu> on 2013/10/22 16:50:01 UTC

Combiner Execution

For a class project my group and I are looking to experiment with 
combining the output from Mappers on the same node or in the same rack. 
We found the idea at http://wiki.apache.org/hadoop/HadoopResearchProjects.

According to http://developer.yahoo.com/hadoop/tutorial/module4.html the 
output is already combined over all Mappers in a node. But we can not 
find how this is happening. Can someone point us to where this combiner 
is executed?

Thanks,
Mike Schliep

Re: Combiner Execution

Posted by Tsuyoshi OZAWA <oz...@gmail.com>.

Hi,

I'm working in node-level aggregation for MapReduce. Please check the
JIRA as follows:
https://issues.apache.org/jira/browse/MAPREDUCE-4502
I'm waiting for the review by community.

And it also can be implemented in Tez as Bikas and Gopal mentioned.

Thanks,

On Wed, Oct 23, 2013 at 1:28 AM, Bikas Saha <bi...@hortonworks.com> wrote:
> +1. A node level or rack level or any level intermediate combiner is
> fairly straightforward to add in Tez. Please carry over your question to
> the Apache Tez dev mailing list dev@tez.incubator.apache.org if you are
> interested in following that path.
>
> Bikas
>
> -----Original Message-----
> From: gopal@hortonworks.com [mailto:gopal@hortonworks.com] On Behalf Of
> Gopal Vijayaraghavan
> Sent: Tuesday, October 22, 2013 9:03 AM
> To: common-dev@hadoop.apache.org
> Subject: Re: Combiner Execution
>
> Hi,
>
> I'll answer your questions in reverse.
>
>> According to http://developer.yahoo.com/hadoop/tutorial/module4.html the
> output is already combined over all Mappers in a node. But we can not find
> how this is happening. Can someone point us to where this combiner is
> executed?
>
> You'll find the Combiner runner somewhere buried inside MapTask.java, hunt
> for the combinerRunner in there.
>
> The Combiner only combines the output of a single map-task (after
> sorting). This kicks in only if there are spills in that 1 map-task >
> minSpillsForCombine.
>
> It does not do any cross-task actions and the MR framework (as it is
> today) doesn't leave enough room for scheduling a cross-task activity (i.e
> MR is strictly bi-partite).
>
>> For a class project my group and I are looking to experiment with
> combining the output from Mappers on the same node or in the same rack. We
> found the idea at http://wiki.apache.org/hadoop/HadoopResearchProjects.
>
> Your general idea is sort of chalked out in Apache Tez (per-host/per-rack
> multi-level combiner trees, which is designed to be more flexible with its
> plumbing) -
> https://issues.apache.org/jira/browse/TEZ-145
>
> Cheers,
> Gopal
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified
> that any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender
> immediately and delete it from your system. Thank You.
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.



-- 
- Tsuyoshi

RE: Combiner Execution

Posted by Bikas Saha <bi...@hortonworks.com>.

+1. A node level or rack level or any level intermediate combiner is
fairly straightforward to add in Tez. Please carry over your question to
the Apache Tez dev mailing list dev@tez.incubator.apache.org if you are
interested in following that path.

Bikas

-----Original Message-----
From: gopal@hortonworks.com [mailto:gopal@hortonworks.com] On Behalf Of
Gopal Vijayaraghavan
Sent: Tuesday, October 22, 2013 9:03 AM
To: common-dev@hadoop.apache.org
Subject: Re: Combiner Execution

Hi,

I'll answer your questions in reverse.

> According to http://developer.yahoo.com/hadoop/tutorial/module4.html the
output is already combined over all Mappers in a node. But we can not find
how this is happening. Can someone point us to where this combiner is
executed?

You'll find the Combiner runner somewhere buried inside MapTask.java, hunt
for the combinerRunner in there.

The Combiner only combines the output of a single map-task (after
sorting). This kicks in only if there are spills in that 1 map-task >
minSpillsForCombine.

It does not do any cross-task actions and the MR framework (as it is
today) doesn't leave enough room for scheduling a cross-task activity (i.e
MR is strictly bi-partite).

> For a class project my group and I are looking to experiment with
combining the output from Mappers on the same node or in the same rack. We
found the idea at http://wiki.apache.org/hadoop/HadoopResearchProjects.

Your general idea is sort of chalked out in Apache Tez (per-host/per-rack
multi-level combiner trees, which is designed to be more flexible with its
plumbing) -
https://issues.apache.org/jira/browse/TEZ-145

Cheers,
Gopal

--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity
to which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified
that any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender
immediately and delete it from your system. Thank You.

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Combiner Execution

Posted by Gopal Vijayaraghavan <go...@apache.org>.

Hi,

I'll answer your questions in reverse.

> According to http://developer.yahoo.com/hadoop/tutorial/module4.html the output is already combined over all Mappers in a node. But we can not find how this is happening. Can someone point us to where this combiner is executed?

You'll find the Combiner runner somewhere buried inside MapTask.java,
hunt for the combinerRunner in there.

The Combiner only combines the output of a single map-task (after
sorting). This kicks in only if there are spills in that 1 map-task >
minSpillsForCombine.

It does not do any cross-task actions and the MR framework (as it is
today) doesn't leave enough room for scheduling a cross-task activity
(i.e MR is strictly bi-partite).

> For a class project my group and I are looking to experiment with combining the output from Mappers on the same node or in the same rack. We found the idea at http://wiki.apache.org/hadoop/HadoopResearchProjects.

Your general idea is sort of chalked out in Apache Tez
(per-host/per-rack multi-level combiner trees, which is designed to be
more flexible with its plumbing) -
https://issues.apache.org/jira/browse/TEZ-145

Cheers,
Gopal

--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.

Re: Combiner Execution

Posted by Amr Shahin <am...@gmail.com>.

What sort of combining are you trying to achieve?
Hadoop combining means that hadoop will collect the output from all the
maps and guarantee that all the outputs that have the same key will be sent
to the same reducer (you can find more details in "hadoop the definitive
guide chapters 2 and 6).

On Tue, Oct 22, 2013 at 5:50 PM, Mike Schliep <sc...@cs.umn.edu> wrote:

> For a class project my group and I are looking to experiment with
> combining the output from Mappers on the same node or in the same rack. We
> found the idea at http://wiki.apache.org/hadoop/**HadoopResearchProjects<http://wiki.apache.org/hadoop/HadoopResearchProjects>
> .
>
> According to http://developer.yahoo.com/**hadoop/tutorial/module4.html<http://developer.yahoo.com/hadoop/tutorial/module4.html>the output is already combined over all Mappers in a node. But we can not
> find how this is happening. Can someone point us to where this combiner is
> executed?
>
> Thanks,
> Mike Schliep
>