You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Thejas M Nair (JIRA)" <ji...@apache.org> on 2012/07/27 03:04:34 UTC

[jira] [Commented] (PIG-2829) Use partial aggregation more aggresively

    [ https://issues.apache.org/jira/browse/PIG-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13423602#comment-13423602 ] 

Thejas M Nair commented on PIG-2829:
------------------------------------

I will review the patch soon. Some comments regarding the default configuration - 

bq. 2: changes existing default values: 
After thinking of the multi-query use case, where you can have multiple POPartialAgg operators in a map task, I am having second thoughts on turning partial agg on by default. Can you try these settings queries where there are around 10+ group+agg that get combined into single MR job ? Maybe we should address the potential OOM issues for this use case before we change the defaults. This is likely to be become a bigger issue when we use 100k records to decide to turn on/off the partial aggregation.

bq. 3: adds a property pig.exec.mapPartAgg.reduction.checkinterval which defaults to 100k, so after processing every 100k records mapagg will check the reduction rate to see if it should be disabled. Previously we only look at first 1000 records.
Can you do some benchmarks to see if there is any noticeable difference in runtime because of the delay in turning mapPartAgg off ? 
                
> Use partial aggregation more aggresively
> ----------------------------------------
>
>                 Key: PIG-2829
>                 URL: https://issues.apache.org/jira/browse/PIG-2829
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.10.0
>            Reporter: Jie Li
>         Attachments: 2829.1.patch, 2829.separate.options.patch, pigmix-10G.png, tpch-10G.png
>
>
> Partial aggregation (Hash Aggregation, aka in-map combiner) is a new feature in Pig 0.10 that will perform aggregation within map function. The main advantage against combiner is it avoids de/serializing and sorting the data, and it can auto disable itself if the data reduction rate is low. Currently it's disabled by default.
> To leverage the power of PartialAgg more aggressively, several things need to be revisited:
> 1. The threshold of auto-disabling. Currently each mapper looks at first 1k (hard-coded) records to see if there's enough data size reduction (defaults to 10x, configurable). The check would happen earlier if the hash table gets full before processing the 1k records (hash table size is controlled by pig.cachedbag.memusage). We might want to relax these thresholds.
> 2. Dependency on the combiner. Currently the PartialAgg won't work without a combiner following it, so we need to provide separate options to enable each independently. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira