You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org> on 2008/09/14 10:19:44 UTC

[jira] Issue Comment Edited: (HADOOP-4086) Add limit to Hive QL

    [ https://issues.apache.org/jira/browse/HADOOP-4086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630855#action_12630855 ] 

jsensarma edited comment on HADOOP-4086 at 9/14/08 1:18 AM:
--------------------------------------------------------------------

some questions:

- The extra reducesink (in the limitmap -> reducesink -> linkreduce) - what will it reduce on?
- in many cases - the limit does not seem to need a reduce. for example - in the dumbest case - select * limit N - we just need to run the mappers and then keep concatenating mapper outputs until we have N rows.
- in the other case where the priot output is sorted/grouped - we need to have top-N operator as limit - that merges prior output and gets top N.

based on last 2 observations - i find it much easier to understand the limit operator implementation as:
- a simple select * like operator on a dataset (a table - whether it's an intermediate dataset or not)
- there are two cases:
  - if the table/data is sorted/grouped - then the limit operator needs to do a merge of all the tables files and produce top N
  - if the table/data is not sorted/grouped - then the limit task needs to get any N rows - possibly by scanning one file at a time
the limit operator is sequential by definition.

the limit operator can run in a single mapper map-only hadoop job in case it's writing to a file - or if it's writing to console (select * limit N) - can just run from the client side. this is orthogonal to what it does.





      was (Author: jsensarma):
    some questions:

- The extra reducesink (in the limitmap -> reducesink -> linkreduce) - what will it reduce on?
- in many cases - the limit does not seem to need a reduce. for example - in the dumbest case - select * limit N - we just need to run the mappers and then keep concatenating mapper outputs until we have N rows.
- in the other case where the output is sorted/grouped - we need to have N from each mapper and then limit N in reducer (standard top N operator

based on last 2 observations - i find it much easier to understand the limit operator implementation as:
- a simple select * like operator on a dataset (a table - whether it's an intermediate dataset or not)
- there are two cases:
  - if the table/data is sorted/grouped - then the limit operator needs to do a merge of all the tables files and produce top N
  - if the table/data is not sorted/grouped - then the limit task needs to get any N rows - possibly by scanning one file at a time
the limit operator is sequential by definition.

the limit task can run in a single mapper map-only hadoop job in case it's writing to a file - or if it's writing to console (select * limit N) - can just run from the client side. this is orthogonal to what it does.




  
> Add limit to Hive QL
> --------------------
>
>                 Key: HADOOP-4086
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4086
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/hive
>            Reporter: Ashish Thusoo
>            Assignee: Ashish Thusoo
>
> Add a limit feature to the Hive Query language.
> so you can do the following things:
> SELECT * FROM T LIMIT 10;
> and this would just return the 10 rows.
> No gaurantees are made on which 10 rows are returned by the query.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.