You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Daniel Dai (JIRA)" <ji...@apache.org> on 2009/10/30 22:03:59 UTC

[jira] Commented: (PIG-1038) Optimize nested distinct/sort to use secondary key

    [ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772085#action_12772085 ] 

Daniel Dai commented on PIG-1038:
---------------------------------

Here is the design for this optimization:
1. Add SecondaryKeyOptimizer, which optimize map-reduce plan. It will
1.1 Discover if we use sort/distinct in nested foreach plan. 
1.2 For the first such sort/distinct, use the sort/distinct key as the secondary key
1.3 Once SecondaryKeyOptimizer discover secondary key, it will call POLocalRearrange.setSecondaryPlan, then drop sort or simplify distinct

2. Change POLocalRearrange
2.1 Add setSecondaryPlan to provide a way to set secondary plan for SecondaryKeyOptimizer
2.2 Change constructLROutput to make a compound key, which is a tuple: (key, secondaryKey)
2.3 We need to duplicate the logic to strip key from values for the secondary key as well

3. Change POPackageAnnotator to patch POPackage with the keyinfo from both key and secondaryKey

4. Change POPackage to stitch secondary key to the value

5. Change MapReduceOper to indicate that map-reduce operator needs secondary key, and JobControlCompiler will set OutputValueGroupingComparator to use the mainKeyComparator

6. Add mainKeyComparator which inherits PigNullableWritable and only compare the main key. We need that for the OutputValueGroupingComparator

> Optimize nested distinct/sort to use secondary key
> --------------------------------------------------
>
>                 Key: PIG-1038
>                 URL: https://issues.apache.org/jira/browse/PIG-1038
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.4.0
>            Reporter: Olga Natkovich
>            Assignee: Daniel Dai
>             Fix For: 0.6.0
>
>
> If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query. 
> Eg1:
> A = load 'mydata';
> B = group A by $0;
> C = foreach B {
>     D = order A by $1;
>     generate group, D;
> }
> store C into 'myresult';
> We can specify a secondary sort on A.$1, and drop "order A by $1".
> Eg2:
> A = load 'mydata';
> B = group A by $0;
> C = foreach B {
>     D = A.$1;
>     E = distinct D;
>     generate group, E;
> }
> store C into 'myresult';
> We can specify a secondary sort key on A.$1, and simplify "D=A.$1; E=distinct D" to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.