You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Alex Bain (JIRA)" <ji...@apache.org> on 2014/01/14 02:02:14 UTC

[jira] [Commented] (PIG-3557) Implement optimizations for LIMIT

    [ https://issues.apache.org/jira/browse/PIG-3557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870217#comment-13870217 ] 

Alex Bain commented on PIG-3557:
--------------------------------

1. If the previous stage using 1 reduce, no need to add one more vertex

I think in particular this should be that if the cardinality of the previous Tez vertex (i.e. the number of run-time tasks generated by the previous vertex) equals 1, then there is no more need to add another vertex. Does this still sound like an optimization that we should still implement, and if so, how would we check such a thing?

2. If the limitplan is null (ie, not the "limited order by" case), we might not need a shuffle edge, a pass through edge should be enough if possible

I'm reading this to mean that a sort isn't necessary on the edge. Daniel - the way you wrote this, it sounds like you want to set the edge to be a broadcast edge rather than a scatter-gather or one-to-one edge. However, since the parallelism of the the final vertex that implements LIMIT is one, I don't think any of these really make a difference (correct me if I am wrong). Thus, I'm reading this to mean that we don't need to do any sorting on the edge (LIMIT explicitly says that it might return any order).

For my patch, I am changing the input / output types to be OnFileUnorderedKVOutput / ShuffledUnorderedKVInput and leaving the edge type to SCATTER_GATHER. However, I have these lines commented out with a note that they should be uncommented after TEZ-661. Does this sound like the right thing to do?

3. Similar to PIG-1270, we can push limit to InputHandler

This is done by the LimitOptimizer already, I have gone through and verified it.

4. We also need to think through the "limited order by" case once "order by" is implemented

There is quite a bit of code to handle LIMIT in TezCompiler::getSortJobs. Is this code already sufficient?

> Implement optimizations for LIMIT
> ---------------------------------
>
>                 Key: PIG-3557
>                 URL: https://issues.apache.org/jira/browse/PIG-3557
>             Project: Pig
>          Issue Type: Sub-task
>          Components: tez
>    Affects Versions: tez-branch
>            Reporter: Alex Bain
>            Assignee: Alex Bain
>
> Implement optimizations for LIMIT when other parts of Pig-on-Tez are more mature. Some of the optimizations mentioned by Daniel include:
> 1. If the previous stage using 1 reduce, no need to add one more vertex
> 2. If the limitplan is null (ie, not the "limited order by" case), we might not need a shuffle edge, a pass through edge should be enough if possible
> 3. Similar to PIG-1270, we can push limit to InputHandler
> 4. We also need to think through the "limited order by" case once "order by" is implemented



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)