You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Dmitriy V. Ryaboy (JIRA)" <ji...@apache.org> on 2012/08/23 00:12:42 UTC

[jira] [Commented] (PIG-2888) Improve performance of POPartialAgg

    [ https://issues.apache.org/jira/browse/PIG-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439904#comment-13439904 ] 

Dmitriy V. Ryaboy commented on PIG-2888:
----------------------------------------

The current implementation makes a two key assumptions that are frequently violated in real-life datasets and scripts:

1) The intermediate UDF is cheap to invoke
2) Records come in mostly-grouped order (records with the same key tend to follow each other).

When condition 2 is not satisfied, POPartialAgg winds up calling the intermediate UDF on all accumulated values so far for a given key, plus a new tuple, for every single tuple it sees. This causes a significant performance degradation.

Instead, we propose accumulating tuples across the board until a memory threshold is reached. Once this threshold is reached, all keys and tuples are fed into the intermediate UDF and the results put into a second-level map (presumably, having been significantly shrunk by the intermediate UDF).  This repeats until the second-level map hits its threshold, at which point *it* is summarized and its values replaced with the aggregated ones. If after such a reduction the memory occupied by the hashmap is still near the threshold, the results are returned to the regular MR pipeline.
                
> Improve performance of POPartialAgg
> -----------------------------------
>
>                 Key: PIG-2888
>                 URL: https://issues.apache.org/jira/browse/PIG-2888
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Dmitriy V. Ryaboy
>            Assignee: Dmitriy V. Ryaboy
>
> During performance testing, we found that POPartialAgg can cause performance degradation for Pig jobs when the Algebraic UDFs it's being applied to aren't well suited to the operator's assumptions. Changing the implementation to a more flexible hash-based model can provide significant performance improvements.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira