You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "E. Sammer (JIRA)" <ji...@apache.org> on 2010/01/19 05:44:55 UTC

[jira] Commented: (HIVE-887) Allow SELECT without a mapreduce job

    [ https://issues.apache.org/jira/browse/HIVE-887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802089#action_12802089 ] 

E. Sammer commented on HIVE-887:
--------------------------------

I would also like this kind of functionality. I would add WHERE clause support to the request though as there are cases where you know a table will be small. What would be really ideal is to be able to define a projected threshold where, if the query execution engine think there may be many rows, it resorts to a MR job, but if under, performs client side fetch and filter. The expectation is that GROUP BY, joins, ORDER / SORT / CLUSTER and related would always cause a MR job.

Ex:

SELECT a, b FROM t WHERE c = 'foo' FETCH n;

where n is an upper limit for which a fetch should be done based on the projected number of rows. If projection is still not yet on the table in Hive (I haven't looked at the internals), maybe FETCH n acts like a fetch + limit operation. Maybe n is simply some global configuration parameter, although that seems too inflexible.

For me, Hive has been excellent for storing raw parsed log data which can be queried into summary tables of around 1 million rows. These summary tables containing aggregations are then queried by a UI for visualization. This "fetch" functionality would allow for the UI load times to go from minutes to seconds and reduce contention for task slots in a production Hadoop cluster.

> Allow SELECT <col> without a mapreduce job
> ------------------------------------------
>
>                 Key: HIVE-887
>                 URL: https://issues.apache.org/jira/browse/HIVE-887
>             Project: Hadoop Hive
>          Issue Type: New Feature
>         Environment: All
>            Reporter: Eric Sun
>            Assignee: Ning Zhang
>
> I often find myself needing to take a quick look at a particular column of a Hive table.
> I usually do this by doing a 
> SELECT * from <table> LIMIT 20;
> from the CLI.  Doing this is pretty fast since it doesn't require a mapreduce job.  However, it's tough to examine just 1 or 2 columns when the table is very wide.
> So, I might do
> SELECT <col> from <table> LIMIT 20;
> but it's much slower since it requires a map-reduce.  It'd be really convenient if a map-reduce wasn't necessary.
> Currently a good work around is to do
> hive -e "select * from table" | cut --key=n
> but it'd be more convenient if it were built in since it alleviates the need for column counting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.