You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Daniel Dai (JIRA)" <ji...@apache.org> on 2009/10/30 22:01:59 UTC

[jira] Updated: (PIG-1038) Optimize nested distinct/sort to use secondary key

     [ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Dai updated PIG-1038:
----------------------------

          Component/s: impl
          Description: 
If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query. 

Eg1:
A = load 'mydata';
B = group A by $0;
C = foreach B {
    D = order A by $1;
    generate group, D;
}
store C into 'myresult';

We can specify a secondary sort on A.$1, and drop "order A by $1".

Eg2:
A = load 'mydata';
B = group A by $0;
C = foreach B {
    D = A.$1;
    E = distinct D;
    generate group, E;
}
store C into 'myresult';

We can specify a secondary sort key on A.$1, and simplify "D=A.$1; E=distinct D" to a special version of distinct, which does not do the sorting.

  was:Since the data coming to the reducer is sorted on group+distinct, we don't need to see all distinct values at once

    Affects Version/s: 0.4.0
        Fix Version/s: 0.6.0
              Summary: Optimize nested distinct/sort to use secondary key  (was: stream nested distinct for in case of accumulate interface)

> Optimize nested distinct/sort to use secondary key
> --------------------------------------------------
>
>                 Key: PIG-1038
>                 URL: https://issues.apache.org/jira/browse/PIG-1038
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.4.0
>            Reporter: Olga Natkovich
>            Assignee: Daniel Dai
>             Fix For: 0.6.0
>
>
> If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query. 
> Eg1:
> A = load 'mydata';
> B = group A by $0;
> C = foreach B {
>     D = order A by $1;
>     generate group, D;
> }
> store C into 'myresult';
> We can specify a secondary sort on A.$1, and drop "order A by $1".
> Eg2:
> A = load 'mydata';
> B = group A by $0;
> C = foreach B {
>     D = A.$1;
>     E = distinct D;
>     generate group, E;
> }
> store C into 'myresult';
> We can specify a secondary sort key on A.$1, and simplify "D=A.$1; E=distinct D" to a special version of distinct, which does not do the sorting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.