You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Runping Qi (JIRA)" <ji...@apache.org> on 2008/03/07 04:54:58 UTC

[jira] Created: (HADOOP-2960) A mapper should use some heuristics to decide whether to run the combiner during spills

A mapper should use some heuristics to decide whether to run the combiner during spills
---------------------------------------------------------------------------------------

                 Key: HADOOP-2960
                 URL: https://issues.apache.org/jira/browse/HADOOP-2960
             Project: Hadoop Core
          Issue Type: Bug
            Reporter: Runping Qi



Right now, the combiner, if set, will be called for each spill, no mapper whether the combiner can actually reduce the values.
The mapper should use some heuristics to decide whether to run the combiner during spills.
One of such heuristics is to check the the ratio of  the nymber of keys to the number of unique keys in the spill.
The combiner will be called only if that ration exceeds certain threshold (say 2).



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2960) A mapper should use some heuristics to decide whether to run the combiner during spills

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576057#action_12576057 ] 

Owen O'Malley commented on HADOOP-2960:
---------------------------------------

Computing the number of unique keys is not free. In particular, it will cost O(N) comparisons. Even worse, this doesn't scale. Currently, combiners are only applied on the original spill, where your approach could be done. However, we plan to apply combiners every time you write to disk during the merge sort. There, you certainly can't count the duplicated keys without a prohibitive cost. 

-1

Once we have HADOOP-2399, almost any reduction in the cost of network and disk i/o should be worth the cost of the combiner.

> A mapper should use some heuristics to decide whether to run the combiner during spills
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-2960
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2960
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Runping Qi
>
> Right now, the combiner, if set, will be called for each spill, no mapper whether the combiner can actually reduce the values.
> The mapper should use some heuristics to decide whether to run the combiner during spills.
> One of such heuristics is to check the the ratio of  the nymber of keys to the number of unique keys in the spill.
> The combiner will be called only if that ration exceeds certain threshold (say 2).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2960) A mapper should use some heuristics to decide whether to run the combiner during spills

Posted by "Runping Qi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576214#action_12576214 ] 

Runping Qi commented on HADOOP-2960:
------------------------------------


Uniq key counting can be done as a part of sort. YOu don't need extra computing at  all.

I don't think the overhead of calling combiner can be dismissed.
It does not make sense to call it most keys are unique, which is very common if the number of reducers is large.


> A mapper should use some heuristics to decide whether to run the combiner during spills
> ---------------------------------------------------------------------------------------
>
>                 Key: HADOOP-2960
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2960
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Runping Qi
>
> Right now, the combiner, if set, will be called for each spill, no mapper whether the combiner can actually reduce the values.
> The mapper should use some heuristics to decide whether to run the combiner during spills.
> One of such heuristics is to check the the ratio of  the nymber of keys to the number of unique keys in the spill.
> The combiner will be called only if that ration exceeds certain threshold (say 2).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.