You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Jie Li (JIRA)" <ji...@apache.org> on 2012/07/20 01:26:33 UTC
[jira] [Updated] (PIG-2829) Use partial aggregation more aggresively

     [ https://issues.apache.org/jira/browse/PIG-2829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jie Li updated PIG-2829:
------------------------

    Attachment: tpch-10G.png
                pigmix-10G.png

Attached some benchmark results of a group-by aggregation on different dataset (TPC-H and Pigmix) with different selectivity and with combiner/PartialAgg turned on/off respectively. 

'none' means not using combiner or PartialAgg. 'combiner' means only using the combiner. 'hash' means only using the PartialAgg (with some hack). 'hash+combiner' means enabling PartialAgg. For the latter two we configures the minimum reduction to 1 so PartialAgg is never auto-disabled (otherwise it'd be auto-disabled in all cases currently). For TPC-H We also run Hive with default settings for reference.

The titles above each chart show the number of input records and output records of each query, for example, "60M -> 4 reduction" means there are 60 million input records and four output records. For both dataset we use 10GB data, and ran on a single machine, which is ok here as we are comparing PartialAgg with combiner so the network doesn't matter much here. 

>From the results we can observe:

1) PartialAgg is more efficient than the combiner, which is as expected and should be leveraged;
2) the combiner is unnecessary when PartialAgg is used;
3) the PartialAgg/combiner overhead can be significant if the data reduction rate is low.

                
> Use partial aggregation more aggresively
> ----------------------------------------
>
>                 Key: PIG-2829
>                 URL: https://issues.apache.org/jira/browse/PIG-2829
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.10.0
>            Reporter: Jie Li
>         Attachments: pigmix-10G.png, tpch-10G.png
>
>
> Partial aggregation (Hash Aggregation, aka in-map combiner) is a new feature in Pig 0.10 that will perform aggregation within map function. The main advantage against combiner is it avoids de/serializing and sorting the data, and it can auto disable itself if the data reduction rate is low. Currently it's disabled by default.
> To leverage the power of PartialAgg more aggressively, several things need to be revisited:
> 1. The threshold of auto-disabling. Currently each mapper looks at first 1k (hard-coded) records to see if there's enough data size reduction (defaults to 10x, configurable). The check would happen earlier if the hash table gets full before processing the 1k records (hash table size is controlled by pig.cachedbag.memusage). We might want to relax these thresholds.
> 2. Dependency on the combiner. Currently the PartialAgg won't work without a combiner following it, so we need to provide separate options to enable each independently. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira