You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Daniel Dai (JIRA)" <ji...@apache.org> on 2009/10/30 22:01:59 UTC
[jira] Updated: (PIG-1038) Optimize nested distinct/sort to use
secondary key
[ https://issues.apache.org/jira/browse/PIG-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Daniel Dai updated PIG-1038:
----------------------------
Component/s: impl
Description:
If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query.
Eg1:
A = load 'mydata';
B = group A by $0;
C = foreach B {
D = order A by $1;
generate group, D;
}
store C into 'myresult';
We can specify a secondary sort on A.$1, and drop "order A by $1".
Eg2:
A = load 'mydata';
B = group A by $0;
C = foreach B {
D = A.$1;
E = distinct D;
generate group, E;
}
store C into 'myresult';
We can specify a secondary sort key on A.$1, and simplify "D=A.$1; E=distinct D" to a special version of distinct, which does not do the sorting.
was:Since the data coming to the reducer is sorted on group+distinct, we don't need to see all distinct values at once
Affects Version/s: 0.4.0
Fix Version/s: 0.6.0
Summary: Optimize nested distinct/sort to use secondary key (was: stream nested distinct for in case of accumulate interface)
> Optimize nested distinct/sort to use secondary key
> --------------------------------------------------
>
> Key: PIG-1038
> URL: https://issues.apache.org/jira/browse/PIG-1038
> Project: Pig
> Issue Type: Improvement
> Components: impl
> Affects Versions: 0.4.0
> Reporter: Olga Natkovich
> Assignee: Daniel Dai
> Fix For: 0.6.0
>
>
> If nested foreach plan contains sort/distinct, it is possible to use hadoop secondary sort instead of SortedDataBag and DistinctDataBag to optimize the query.
> Eg1:
> A = load 'mydata';
> B = group A by $0;
> C = foreach B {
> D = order A by $1;
> generate group, D;
> }
> store C into 'myresult';
> We can specify a secondary sort on A.$1, and drop "order A by $1".
> Eg2:
> A = load 'mydata';
> B = group A by $0;
> C = foreach B {
> D = A.$1;
> E = distinct D;
> generate group, E;
> }
> store C into 'myresult';
> We can specify a secondary sort key on A.$1, and simplify "D=A.$1; E=distinct D" to a special version of distinct, which does not do the sorting.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.