You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Srikanth Kakani (JIRA)" <ji...@apache.org> on 2007/10/29 22:18:51 UTC

[jira] Commented: (HADOOP-363) When combiners exist, postpone mappers' spills of map output to disk until combiners are unsuccessful.

    [ https://issues.apache.org/jira/browse/HADOOP-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12538599 ] 

Srikanth Kakani commented on HADOOP-363:
----------------------------------------

We recently ran into a problem which makes this a very important feature to have.

Background:
Consider an application that generates key values that are uniqued by the combiner. The data of this application needs to be partitioned based on the key (in this case imbalenced, some keys have a lot of values to be uniqued). The amount of input data is huge and hence needs a huge number of mappers to accomplish the job. We end up with huge number of map outputs, a high percentage of which go to a single reducer.
Multi level merges occur at various stages, output of maps, before partitioning, and ofcourse after the data reaches the reducers.

Suppose we impose a combiner at each merge we are doing, we can reduce the amount of data output after merge by a huge amount, thereby reducing the amount of data that goes into a reducer by a huge amount.

In our specific case, combiners applied at all merges can potentially reduce the data to 75% of the original. Additionally, we "expect" the merges save even more time as they write out less data on each pass.

Notes:
A unique combiner should be provided by the framework: 
If the key/value bytecompare to another key/value pair, drop the new one.

> When combiners exist, postpone mappers' spills of map output to disk until combiners are unsuccessful.
> ------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-363
>                 URL: https://issues.apache.org/jira/browse/HADOOP-363
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Dick King
>            Assignee: Owen O'Malley
>
> When a map/reduce job is set up with a combiner, the mapper tasks each build up an in-heap collection of 100K key/value pairs -- and then apply the combiner to reduce that to whatever it becomes by applying the combiner to sets with like keys before spilling to disk to send it to the reducers.
> Typically running the combiner consumes a lot less resources than shipping the data, especially since the data end up in a reducer where probably the same code will be run anyway.
> I would like to see this changed so that when the combiner shrinks the 100K key/value pairs to less than, say, 90K, we just keep running the mapper and combiner alternately until we get enough distinct keys to make this unlikely to be worthwhile [or until we run out of input, of course].
> This has two costs: the whole internal buffer has to be re-sorted so we can apply the combiner even though as few as 10K new elements have been added, and in some cases we'll call the combiner on many singletons.  
> The first of these costs can be avoided by doing a mini-sort in the new pairs section and doing a merge to develop the combiner sets and the new sorted retained elements section.
> The second of these costs can be avoided by detecting what would otherwise be singleton combiner calls and not making them, which is a good idea in itself even if we don't decide to do this reform.
> The two techniques combine well; recycled elements of the buffer need not be combined if there's no new element with the same key.
> -dk

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.