You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Daniel Dai (JIRA)" <ji...@apache.org> on 2017/04/04 07:53:41 UTC

[jira] [Commented] (PIG-5211) Optimize Nested Limited Sort

    [ https://issues.apache.org/jira/browse/PIG-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954743#comment-15954743 ] 

Daniel Dai commented on PIG-5211:
---------------------------------

Looks pretty good so far. Need to fine tune NestedLimitOptimizer, existence of both LOLimit and LOSort is not enough, must make sure LOLimit is right after LOSort, or you can follow LimitOptimizer to push LOLimit all the way up, which is more sophisticated (I am not insisting this tough). Also SecondaryKeyOptimizer does not recognize limited nested sort currently, it is possible SecondaryKeyOptimizer optimize limited sort into MR/Tez secondary sort, thus the limit is lost. So we shall disable SecondaryKeyOptimizer if the nested sort is a limited sort in SecondaryKeyOptimizer. You can use the following script as the test case which SecondaryKeyOptimizer is get involved:
{code}
a = load 'studenttab10k' as (name:chararray, age:int, gpa:double);
b = group a by name;
c = foreach b {
    c1 = order a by age;
    c2 = limit c1 5;
    generate c2;
}
explain c;
{code}

> Optimize Nested Limited Sort
> ----------------------------
>
>                 Key: PIG-5211
>                 URL: https://issues.apache.org/jira/browse/PIG-5211
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Jin Sun
>            Assignee: Jin Sun
>             Fix For: 0.17.0
>
>         Attachments: PIG-5211-1.patch
>
>
> Currently in FOREACH clause, if both LIMIT and ORDER BY are present, pig stores all elements and sort them. It should use a priority queue to be more efficient in space. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)