You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org> on 2008/09/14 10:19:44 UTC
[jira] Issue Comment Edited: (HADOOP-4086) Add limit to Hive QL
[ https://issues.apache.org/jira/browse/HADOOP-4086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630855#action_12630855 ]
jsensarma edited comment on HADOOP-4086 at 9/14/08 1:18 AM:
--------------------------------------------------------------------
some questions:
- The extra reducesink (in the limitmap -> reducesink -> linkreduce) - what will it reduce on?
- in many cases - the limit does not seem to need a reduce. for example - in the dumbest case - select * limit N - we just need to run the mappers and then keep concatenating mapper outputs until we have N rows.
- in the other case where the priot output is sorted/grouped - we need to have top-N operator as limit - that merges prior output and gets top N.
based on last 2 observations - i find it much easier to understand the limit operator implementation as:
- a simple select * like operator on a dataset (a table - whether it's an intermediate dataset or not)
- there are two cases:
- if the table/data is sorted/grouped - then the limit operator needs to do a merge of all the tables files and produce top N
- if the table/data is not sorted/grouped - then the limit task needs to get any N rows - possibly by scanning one file at a time
the limit operator is sequential by definition.
the limit operator can run in a single mapper map-only hadoop job in case it's writing to a file - or if it's writing to console (select * limit N) - can just run from the client side. this is orthogonal to what it does.
was (Author: jsensarma):
some questions:
- The extra reducesink (in the limitmap -> reducesink -> linkreduce) - what will it reduce on?
- in many cases - the limit does not seem to need a reduce. for example - in the dumbest case - select * limit N - we just need to run the mappers and then keep concatenating mapper outputs until we have N rows.
- in the other case where the output is sorted/grouped - we need to have N from each mapper and then limit N in reducer (standard top N operator
based on last 2 observations - i find it much easier to understand the limit operator implementation as:
- a simple select * like operator on a dataset (a table - whether it's an intermediate dataset or not)
- there are two cases:
- if the table/data is sorted/grouped - then the limit operator needs to do a merge of all the tables files and produce top N
- if the table/data is not sorted/grouped - then the limit task needs to get any N rows - possibly by scanning one file at a time
the limit operator is sequential by definition.
the limit task can run in a single mapper map-only hadoop job in case it's writing to a file - or if it's writing to console (select * limit N) - can just run from the client side. this is orthogonal to what it does.
> Add limit to Hive QL
> --------------------
>
> Key: HADOOP-4086
> URL: https://issues.apache.org/jira/browse/HADOOP-4086
> Project: Hadoop Core
> Issue Type: New Feature
> Components: contrib/hive
> Reporter: Ashish Thusoo
> Assignee: Ashish Thusoo
>
> Add a limit feature to the Hive Query language.
> so you can do the following things:
> SELECT * FROM T LIMIT 10;
> and this would just return the 10 rows.
> No gaurantees are made on which 10 rows are returned by the query.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.