You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Eric Sammer <er...@lifeless.net> on 2010/01/18 19:10:19 UTC

Non-MR execution plan for simple WHERE

All:

I'm using Hive as a back end for a multilevel aggregation reporting
system. There is a core fact table that is currently multiple millions
of rows, but will quickly hit billions. A web service accepts user
queries, turns them into Hive queries, dynamically builds Hive tables
containing different aggregations that are commonly requested by front
end users if they don't exist, and eventually returns the results. This
web service is async, initially giving the UI a ticket which then must
poll for results. These "summary" tables, which are usually 1 million or
less in row count, are then queried instead of the raw fact table.

In many cases, very simple WHERE clauses are applied to the summary
tables which causes a MR job to be spawned. In cases where we know the
row count will be low, it would be very nice to avoid the MR overhead
and simply apply the filter inline and stream the results back
immediately. By simple, I mean a query containing only equality criteria
with no nested conditional groups, no GROUP BY clause, and no
complicated UDFs. The goal would be to reduce the time to begin
streaming results at the expensive of data locality and the
centralization of the query execution.

The summary tables generally have a block count of only 3 or so, so the
degree of parallelism being sacrificed is minimal (compared to true back
end queries that have no user waiting on results). I know Hive's stated
goals indicate that low latency isn't in the cards, but I don't consider
this to be low latency in the RDBMS sense. I'm thinking of reducing the
start of result streaming from a number of minutes to seconds.

Obviously this is somewhat specialized, but I wanted to get people's
ideas on a keyword that can indicate this type of execution plan, the
feasibility within the code base, and the suitability of this being
within Hive's court. The alternatives of building the results via Hive
and then having another (albeit simple) query layer to comb through the
files is cumbersome to maintain. The notion that one could do something
within Hive is appealing because then, based on a threshold of row
count, you could resort to doing a MR job if necessary even if you think
you're querying a small table. Or, maybe Hive can fail with some
indication that centralized query exceeds the configured threshold.

I wanted to throw this out to the list before filing a feature request
in JIRA for obvious reasons.

Thoughts? Comments?
-- 
Eric Sammer
eric@lifeless.net
http://esammer.blogspot.com

Re: Non-MR execution plan for simple WHERE

Posted by Eric Sammer <er...@lifeless.net>.
On 1/18/10 8:42 PM, Zheng Shao wrote:
> There is already a JIRA for it:
> https://issues.apache.org/jira/browse/HIVE-887
> Please add your thoughts to that JIRA.
> 

Remember, you asked for it! ;)

Thanks for pointing out the specific issue. I've added my $0.02USD along
with the specific use case for reference.

Thanks to both Zheng and Raghu for the pointers.

Regards.
-- 
Eric Sammer
eric@lifeless.net
http://esammer.blogspot.com

Re: Non-MR execution plan for simple WHERE

Posted by Zheng Shao <zs...@gmail.com>.
There is already a JIRA for it:
https://issues.apache.org/jira/browse/HIVE-887
Please add your thoughts to that JIRA.

Zheng
On Mon, Jan 18, 2010 at 10:19 AM, Raghu Murthy <rm...@facebook.com> wrote:

> Currently, only SELECT * with a partition predicate does not create an MR
> job. The client directly streams the files corresponding to the
> table/partition. I think there is a JIRA to also do simple projections and
> filters in the client.
>
> For now, if you know which queries are small, you could run MR locally on
> the client via by providing the following options before running the query.
>
> set mapred.job.tracker=local;
> set mapred.local.dir=/tmp;
>
> On 1/18/10 10:10 AM, "Eric Sammer" <er...@lifeless.net> wrote:
>
> > All:
> >
> > I'm using Hive as a back end for a multilevel aggregation reporting
> > system. There is a core fact table that is currently multiple millions
> > of rows, but will quickly hit billions. A web service accepts user
> > queries, turns them into Hive queries, dynamically builds Hive tables
> > containing different aggregations that are commonly requested by front
> > end users if they don't exist, and eventually returns the results. This
> > web service is async, initially giving the UI a ticket which then must
> > poll for results. These "summary" tables, which are usually 1 million or
> > less in row count, are then queried instead of the raw fact table.
> >
> > In many cases, very simple WHERE clauses are applied to the summary
> > tables which causes a MR job to be spawned. In cases where we know the
> > row count will be low, it would be very nice to avoid the MR overhead
> > and simply apply the filter inline and stream the results back
> > immediately. By simple, I mean a query containing only equality criteria
> > with no nested conditional groups, no GROUP BY clause, and no
> > complicated UDFs. The goal would be to reduce the time to begin
> > streaming results at the expensive of data locality and the
> > centralization of the query execution.
> >
> > The summary tables generally have a block count of only 3 or so, so the
> > degree of parallelism being sacrificed is minimal (compared to true back
> > end queries that have no user waiting on results). I know Hive's stated
> > goals indicate that low latency isn't in the cards, but I don't consider
> > this to be low latency in the RDBMS sense. I'm thinking of reducing the
> > start of result streaming from a number of minutes to seconds.
> >
> > Obviously this is somewhat specialized, but I wanted to get people's
> > ideas on a keyword that can indicate this type of execution plan, the
> > feasibility within the code base, and the suitability of this being
> > within Hive's court. The alternatives of building the results via Hive
> > and then having another (albeit simple) query layer to comb through the
> > files is cumbersome to maintain. The notion that one could do something
> > within Hive is appealing because then, based on a threshold of row
> > count, you could resort to doing a MR job if necessary even if you think
> > you're querying a small table. Or, maybe Hive can fail with some
> > indication that centralized query exceeds the configured threshold.
> >
> > I wanted to throw this out to the list before filing a feature request
> > in JIRA for obvious reasons.
> >
> > Thoughts? Comments?
> > --
> > Eric Sammer
> > eric@lifeless.net
> > http://esammer.blogspot.com
>
>


-- 
Yours,
Zheng

Re: Non-MR execution plan for simple WHERE

Posted by Raghu Murthy <rm...@facebook.com>.
Currently, only SELECT * with a partition predicate does not create an MR
job. The client directly streams the files corresponding to the
table/partition. I think there is a JIRA to also do simple projections and
filters in the client.

For now, if you know which queries are small, you could run MR locally on
the client via by providing the following options before running the query.

set mapred.job.tracker=local;
set mapred.local.dir=/tmp;

On 1/18/10 10:10 AM, "Eric Sammer" <er...@lifeless.net> wrote:

> All:
> 
> I'm using Hive as a back end for a multilevel aggregation reporting
> system. There is a core fact table that is currently multiple millions
> of rows, but will quickly hit billions. A web service accepts user
> queries, turns them into Hive queries, dynamically builds Hive tables
> containing different aggregations that are commonly requested by front
> end users if they don't exist, and eventually returns the results. This
> web service is async, initially giving the UI a ticket which then must
> poll for results. These "summary" tables, which are usually 1 million or
> less in row count, are then queried instead of the raw fact table.
> 
> In many cases, very simple WHERE clauses are applied to the summary
> tables which causes a MR job to be spawned. In cases where we know the
> row count will be low, it would be very nice to avoid the MR overhead
> and simply apply the filter inline and stream the results back
> immediately. By simple, I mean a query containing only equality criteria
> with no nested conditional groups, no GROUP BY clause, and no
> complicated UDFs. The goal would be to reduce the time to begin
> streaming results at the expensive of data locality and the
> centralization of the query execution.
> 
> The summary tables generally have a block count of only 3 or so, so the
> degree of parallelism being sacrificed is minimal (compared to true back
> end queries that have no user waiting on results). I know Hive's stated
> goals indicate that low latency isn't in the cards, but I don't consider
> this to be low latency in the RDBMS sense. I'm thinking of reducing the
> start of result streaming from a number of minutes to seconds.
> 
> Obviously this is somewhat specialized, but I wanted to get people's
> ideas on a keyword that can indicate this type of execution plan, the
> feasibility within the code base, and the suitability of this being
> within Hive's court. The alternatives of building the results via Hive
> and then having another (albeit simple) query layer to comb through the
> files is cumbersome to maintain. The notion that one could do something
> within Hive is appealing because then, based on a threshold of row
> count, you could resort to doing a MR job if necessary even if you think
> you're querying a small table. Or, maybe Hive can fail with some
> indication that centralized query exceeds the configured threshold.
> 
> I wanted to throw this out to the list before filing a feature request
> in JIRA for obvious reasons.
> 
> Thoughts? Comments?
> --
> Eric Sammer
> eric@lifeless.net
> http://esammer.blogspot.com