You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Olga Natkovich (JIRA)" <ji...@apache.org> on 2008/06/18 01:02:44 UTC

[jira] Created: (PIG-274) changing combiner behavior past hadoop 18

changing combiner behavior past hadoop 18
-----------------------------------------

Key: PIG-274
URL: https://issues.apache.org/jira/browse/PIG-274
Project: Pig
Issue Type: Bug
Reporter: Olga Natkovich

In hadoop 18, the way commbiners are handled is changing. The hadoop team agreed to keep things backward compatible for now but will depricate the current behavior in the future (likely in hadoop 19) so pig needs to adjust to the new behavior. This should be done in the post 2.0 code base.

Old behavior: combiner is called once and only once per map task
New behavior: combiner can be run 0 or more times on both map and reduce sides. 0 times happens if only a single <K, V> fits into sort buffer. Multiple time can happen in case of a hierarchical merge.

The main issue that causes problem for pig is that we would not know in advance whether the combiner will run 0,1 or more times. This causes several issues:

(1) Lets assume that we compute count. If we enable combiner, reducer expects to get numbers not values as its input. Hadoop team suggested that we could annotate each tuple with a byte that tells if it want through combiner. This could be expensive computatinally as well as will use extra memory. One things to notice is that some algebraics (like SUM, MIN, MAX) don't care whether the data was precombined as they always to the same thing. Perhaps we can make algebaic functions declare if they care or not. Then we only anotate the ones that need it.
(2) Since combiner can be called 1 or more times, getInitial and getIntermediate have to do the same thing. So again, we need to change the interface to reflcat that.
(3) current combiner code assumes that it only works with 1 input. When it runs on the reduce side, it can be dealing with tuples from multiple inputs.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-274) changing combiner behavior past hadoop 18

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12613673#action_12613673 ] 

Olga Natkovich commented on PIG-274:
------------------------------------

Some additional thoughts.

Hash based approach would work well for group by keys with very low cardinality since the data given to each map is usually pretty small. For other cases hashing would actually introduce a significant overhead. Since we don't currently have metadata, we will not be able to make an intelligent choice. Moreover, if the combiner is called after the aggregation, it is completely wasted since the data is already reduced.

Alan suggested that we could opportunistically reduce data if we see adjacent keys that are the same but I am wondering whether the overhead of compare is justified on this case.

Comments are welocome.

> changing combiner behavior past hadoop 18
> -----------------------------------------
>
>                 Key: PIG-274
>                 URL: https://issues.apache.org/jira/browse/PIG-274
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Olga Natkovich
>
> In hadoop 18, the way commbiners are handled is changing. The hadoop team agreed to keep things backward compatible for now but will depricate the current behavior in the future (likely in hadoop 19) so pig needs to adjust to the new behavior. This should be done in the post 2.0 code base.
> Old behavior: combiner is called once and only once per map task
> New behavior: combiner can be run 0 or more times on both map and reduce sides. 0 times happens if only a single <K, V> fits into sort buffer. Multiple time can happen in case of a hierarchical merge.
> The main issue that causes problem for pig is that we would not know in advance whether the combiner will run 0,1 or more times. This causes several issues:
> (1) Lets assume that we compute count. If we enable combiner, reducer expects to get numbers not values as its input. Hadoop team suggested that we could annotate each tuple with a byte that tells if it want through combiner. This could be expensive computatinally as well as will use extra memory. One things to notice is that some algebraics (like SUM, MIN, MAX) don't care whether the data was precombined as they always to the same thing. Perhaps we can make algebaic functions declare if they care or not. Then we only anotate the ones that need it.
> (2) Since combiner can be called 1 or more times, getInitial and getIntermediate have to do the same thing. So again, we need to change the interface to reflcat that.
> (3) current combiner code assumes that it only works with 1 input. When it runs on the reduce side, it can be dealing with tuples from multiple inputs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-274) changing combiner behavior past hadoop 18

Posted by "Pi Song (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12613971#action_12613971 ] 

Pi Song commented on PIG-274:
-----------------------------

Sounds like this hash thing should be implemented in Hadoop instead. The reason not being there already might be because of the risk.

If we want implement it solely in Pig then this feature should be optional.

> changing combiner behavior past hadoop 18
> -----------------------------------------
>
>                 Key: PIG-274
>                 URL: https://issues.apache.org/jira/browse/PIG-274
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Olga Natkovich
>
> In hadoop 18, the way commbiners are handled is changing. The hadoop team agreed to keep things backward compatible for now but will depricate the current behavior in the future (likely in hadoop 19) so pig needs to adjust to the new behavior. This should be done in the post 2.0 code base.
> Old behavior: combiner is called once and only once per map task
> New behavior: combiner can be run 0 or more times on both map and reduce sides. 0 times happens if only a single <K, V> fits into sort buffer. Multiple time can happen in case of a hierarchical merge.
> The main issue that causes problem for pig is that we would not know in advance whether the combiner will run 0,1 or more times. This causes several issues:
> (1) Lets assume that we compute count. If we enable combiner, reducer expects to get numbers not values as its input. Hadoop team suggested that we could annotate each tuple with a byte that tells if it want through combiner. This could be expensive computatinally as well as will use extra memory. One things to notice is that some algebraics (like SUM, MIN, MAX) don't care whether the data was precombined as they always to the same thing. Perhaps we can make algebaic functions declare if they care or not. Then we only anotate the ones that need it.
> (2) Since combiner can be called 1 or more times, getInitial and getIntermediate have to do the same thing. So again, we need to change the interface to reflcat that.
> (3) current combiner code assumes that it only works with 1 input. When it runs on the reduce side, it can be dealing with tuples from multiple inputs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-274) changing combiner behavior past hadoop 18

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12614039#action_12614039 ] 

Olga Natkovich commented on PIG-274:
------------------------------------

According to Ben, that how combiner used to work in earlier versions of Hadoop :).

I agree that Hadoop seems to be pushing this functionality for users to implement.

Also, it seems thta it is not giving us enough control to tell when and how to execute code. Ideally I should be able to say Run/don't run combiner on the map side, run/don't run combiner on the reduce side.

> changing combiner behavior past hadoop 18
> -----------------------------------------
>
>                 Key: PIG-274
>                 URL: https://issues.apache.org/jira/browse/PIG-274
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Olga Natkovich
>
> In hadoop 18, the way commbiners are handled is changing. The hadoop team agreed to keep things backward compatible for now but will depricate the current behavior in the future (likely in hadoop 19) so pig needs to adjust to the new behavior. This should be done in the post 2.0 code base.
> Old behavior: combiner is called once and only once per map task
> New behavior: combiner can be run 0 or more times on both map and reduce sides. 0 times happens if only a single <K, V> fits into sort buffer. Multiple time can happen in case of a hierarchical merge.
> The main issue that causes problem for pig is that we would not know in advance whether the combiner will run 0,1 or more times. This causes several issues:
> (1) Lets assume that we compute count. If we enable combiner, reducer expects to get numbers not values as its input. Hadoop team suggested that we could annotate each tuple with a byte that tells if it want through combiner. This could be expensive computatinally as well as will use extra memory. One things to notice is that some algebraics (like SUM, MIN, MAX) don't care whether the data was precombined as they always to the same thing. Perhaps we can make algebaic functions declare if they care or not. Then we only anotate the ones that need it.
> (2) Since combiner can be called 1 or more times, getInitial and getIntermediate have to do the same thing. So again, we need to change the interface to reflcat that.
> (3) current combiner code assumes that it only works with 1 input. When it runs on the reduce side, it can be dealing with tuples from multiple inputs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-274) changing combiner behavior past hadoop 18

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12606447#action_12606447 ] 

Olga Natkovich commented on PIG-274:
------------------------------------

After some discussion with hadoop guys, I think what they are proposing makes sense.

Pig already treats algebraic the computation as having 3 stages: initial, intermediate final. The proposal is for stages to map to the MR framework as followos:

initial - done in the map 
intermediate - done in combiner; it is fine that the computation is done 0 or more times since it does not impact the correctness of the computation
final - done in the reducer.

We can make the computation quite efficient in the map for a common case by performing hash based aggregation. This way the data to be sorted by the map would significantly smaller than what it is now.

Pradeep will be running some performance tests to confirm this.

Please, comment

> changing combiner behavior past hadoop 18
> -----------------------------------------
>
>                 Key: PIG-274
>                 URL: https://issues.apache.org/jira/browse/PIG-274
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Olga Natkovich
>
> In hadoop 18, the way commbiners are handled is changing. The hadoop team agreed to keep things backward compatible for now but will depricate the current behavior in the future (likely in hadoop 19) so pig needs to adjust to the new behavior. This should be done in the post 2.0 code base.
> Old behavior: combiner is called once and only once per map task
> New behavior: combiner can be run 0 or more times on both map and reduce sides. 0 times happens if only a single <K, V> fits into sort buffer. Multiple time can happen in case of a hierarchical merge.
> The main issue that causes problem for pig is that we would not know in advance whether the combiner will run 0,1 or more times. This causes several issues:
> (1) Lets assume that we compute count. If we enable combiner, reducer expects to get numbers not values as its input. Hadoop team suggested that we could annotate each tuple with a byte that tells if it want through combiner. This could be expensive computatinally as well as will use extra memory. One things to notice is that some algebraics (like SUM, MIN, MAX) don't care whether the data was precombined as they always to the same thing. Perhaps we can make algebaic functions declare if they care or not. Then we only anotate the ones that need it.
> (2) Since combiner can be called 1 or more times, getInitial and getIntermediate have to do the same thing. So again, we need to change the interface to reflcat that.
> (3) current combiner code assumes that it only works with 1 input. When it runs on the reduce side, it can be dealing with tuples from multiple inputs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-274) changing combiner behavior past hadoop 18

Posted by "Pi Song (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PIG-274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12607610#action_12607610 ] 

Pi Song commented on PIG-274:
-----------------------------

Sounds reasonable. Please note that the contract when implementing "Interim" will have to allow multi-level from now on.
All the algebraic functions that we've got should be alright. The new covariance thing should be fine as well.

> changing combiner behavior past hadoop 18
> -----------------------------------------
>
>                 Key: PIG-274
>                 URL: https://issues.apache.org/jira/browse/PIG-274
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Olga Natkovich
>
> In hadoop 18, the way commbiners are handled is changing. The hadoop team agreed to keep things backward compatible for now but will depricate the current behavior in the future (likely in hadoop 19) so pig needs to adjust to the new behavior. This should be done in the post 2.0 code base.
> Old behavior: combiner is called once and only once per map task
> New behavior: combiner can be run 0 or more times on both map and reduce sides. 0 times happens if only a single <K, V> fits into sort buffer. Multiple time can happen in case of a hierarchical merge.
> The main issue that causes problem for pig is that we would not know in advance whether the combiner will run 0,1 or more times. This causes several issues:
> (1) Lets assume that we compute count. If we enable combiner, reducer expects to get numbers not values as its input. Hadoop team suggested that we could annotate each tuple with a byte that tells if it want through combiner. This could be expensive computatinally as well as will use extra memory. One things to notice is that some algebraics (like SUM, MIN, MAX) don't care whether the data was precombined as they always to the same thing. Perhaps we can make algebaic functions declare if they care or not. Then we only anotate the ones that need it.
> (2) Since combiner can be called 1 or more times, getInitial and getIntermediate have to do the same thing. So again, we need to change the interface to reflcat that.
> (3) current combiner code assumes that it only works with 1 input. When it runs on the reduce side, it can be dealing with tuples from multiple inputs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.