You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Shirley Cohen <sc...@cs.utexas.edu> on 2008/12/20 17:48:44 UTC

question about mapred translation

Hi,

I'm just getting started with Hive. I was wondering if anyone can tell 
me how Hive would translate the following SQL query into MapReduce:

select distinct track, count(track)
from music
where date_listened between '12-10-2008' and '12-11-2008'
group by user
order by count(track) desc

How many mapred jobs would it use? How would the map and red functions 
looks like? Also, does Hive have a utility that gives you this information?

Thanks,

Shirley

Re: question about mapred translation

Posted by Zheng Shao <zs...@gmail.com>.

Hi Shirley,

We can do Group By (without a distinct clause) in both 1 and 2 map-reduce
jobs.
If you "set hive.map.aggr=true", we will only have 1 map-reduce jobs.

For "Order By", as Ashish said, we already have a similar clause "Sort by"
which provides orders in each of the reducers. Can you let us know what is
your use case? It's very possible that we can already achieve what you want
by "sort by" and "limit" clauses.

Zheng

On Mon, Dec 22, 2008 at 11:08 AM, Shirley Cohen <sc...@cs.utexas.edu>wrote:

> Hi Jeff and Ashish,
>
> Thanks for your response. Basically, what I was curious about is how Hive
> implements group by operations. Does it do so in one or two map reduce
> stages? Also, are order by's supported in the current version? If not, when
> will they be?
>
> I haven't had a chance to play with Hive yet, but I intend to do so pretty
> soon :))
>
> Shirley
>
> Ashish Thusoo wrote:
>
>> Hi Shirley,
>>  I think this query would give you an error currently, for two reasons:
>>  1. The select list does not contain the group by column and distinct
>> track group by user is indeterminate - what value of track do you want to
>> report here on the group of users?
>> 2. We do not have order by yet though you can sort of simulate it with a
>> sort by clause and 1 reducer.
>>  Can you explain more in terms of what you want the SQL to achieve?
>>  Also the hive wiki contains a number of presentations that describe how
>> SQL gets converted to Map/Reduce plans at a high level. Check them out at
>>  http://wiki.apache.org/hadoop/Hive/Presentations
>>  Ashish
>>
>> ------------------------------------------------------------------------
>> *From:* Jeff Hammerbacher [mailto:hammer@cloudera.com]
>> *Sent:* Saturday, December 20, 2008 11:35 AM
>> *To:* hive-user@hadoop.apache.org
>> *Subject:* Re: question about mapred translation
>>
>> Hey Shirley,
>>
>> Welcome to Hive! Once you've gotten Hive up and running and have created
>> the "music" table, you should be able to say "EXPLAIN <query>", where
>> <query> is the query specified below (or any other query). For more detailed
>> information, you can say "EXPLAIN EXTENDED <query>".
>>
>> The output from the EXPLAIN might be a little obtuse, so if you still have
>> questions, I'm sure more knowledgeable Hive folks can give you insight into
>> the planner.
>>
>> Regards,
>> Jeff
>>
>> On Sat, Dec 20, 2008 at 11:48 AM, Shirley Cohen <scohen@cs.utexas.edu<mailto:
>> scohen@cs.utexas.edu>> wrote:
>>
>>    Hi,
>>
>>    I'm just getting started with Hive. I was wondering if anyone can
>>    tell me how Hive would translate the following SQL query into
>>    MapReduce:
>>
>>    select distinct track, count(track)
>>    from music
>>    where date_listened between '12-10-2008' and '12-11-2008'
>>    group by user
>>    order by count(track) desc
>>
>>    How many mapred jobs would it use? How would the map and red
>>    functions looks like? Also, does Hive have a utility that gives
>>    you this information?
>>
>>    Thanks,
>>
>>    Shirley
>>
>>
>>
>>
>>
>
>


-- 
Yours,
Zheng

RE: question about mapred translation

Posted by Ashish Thusoo <at...@facebook.com>.

Right now Hive does the group bys in two stages - which is not optimal for all kinds of data but it happens to be the plan you would need for large quantities of data that has a power distribution on the group by keys. You could also speed up the group by a bit by using map side aggregations (set hive.map.aggr=true). The only time a single stage map reduce is used is when map side aggregation is on and there is no group by clause (e.g. select count(1) from T).

Zheng can give you more details on the order by though there is a workaround that you can use to do this today with the SORT BY construct. An example is there in the source tree in the following file:

./ql/src/test/queries/clientpositive/join0.q

note that this has to be run with number of reduce = 1 for global sorting.

I would suggest that you do the group by and put it in a table and then issue a second query with a sort by on that table.

Ashish

-----Original Message-----
From: Shirley Cohen [mailto:scohen@cs.utexas.edu] 
Sent: Monday, December 22, 2008 11:09 AM
To: hive-user@hadoop.apache.org
Subject: Re: question about mapred translation

Hi Jeff and Ashish,

Thanks for your response. Basically, what I was curious about is how Hive implements group by operations. Does it do so in one or two map reduce stages? Also, are order by's supported in the current version? If not, when will they be?

I haven't had a chance to play with Hive yet, but I intend to do so pretty soon :))

Shirley

Ashish Thusoo wrote:
> Hi Shirley,
>  
> I think this query would give you an error currently, for two reasons:
>  
> 1. The select list does not contain the group by column and distinct 
> track group by user is indeterminate - what value of track do you want 
> to report here on the group of users?
> 2. We do not have order by yet though you can sort of simulate it with 
> a sort by clause and 1 reducer.
>  
> Can you explain more in terms of what you want the SQL to achieve?
>  
> Also the hive wiki contains a number of presentations that describe 
> how SQL gets converted to Map/Reduce plans at a high level. Check them 
> out at
>  
> http://wiki.apache.org/hadoop/Hive/Presentations
>  
> Ashish
>
> ----------------------------------------------------------------------
> --
> *From:* Jeff Hammerbacher [mailto:hammer@cloudera.com]
> *Sent:* Saturday, December 20, 2008 11:35 AM
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: question about mapred translation
>
> Hey Shirley,
>
> Welcome to Hive! Once you've gotten Hive up and running and have 
> created the "music" table, you should be able to say "EXPLAIN 
> <query>", where <query> is the query specified below (or any other 
> query). For more detailed information, you can say "EXPLAIN EXTENDED 
> <query>".
>
> The output from the EXPLAIN might be a little obtuse, so if you still 
> have questions, I'm sure more knowledgeable Hive folks can give you 
> insight into the planner.
>
> Regards,
> Jeff
>
> On Sat, Dec 20, 2008 at 11:48 AM, Shirley Cohen <scohen@cs.utexas.edu 
> <ma...@cs.utexas.edu>> wrote:
>
>     Hi,
>
>     I'm just getting started with Hive. I was wondering if anyone can
>     tell me how Hive would translate the following SQL query into
>     MapReduce:
>
>     select distinct track, count(track)
>     from music
>     where date_listened between '12-10-2008' and '12-11-2008'
>     group by user
>     order by count(track) desc
>
>     How many mapred jobs would it use? How would the map and red
>     functions looks like? Also, does Hive have a utility that gives
>     you this information?
>
>     Thanks,
>
>     Shirley
>
>
>
>

Re: question about mapred translation

Posted by Shirley Cohen <sc...@cs.utexas.edu>.

Hi Jeff and Ashish,

Thanks for your response. Basically, what I was curious about is how 
Hive implements group by operations. Does it do so in one or two map 
reduce stages? Also, are order by's supported in the current version? If 
not, when will they be?

I haven't had a chance to play with Hive yet, but I intend to do so 
pretty soon :))

Shirley

Ashish Thusoo wrote:
> Hi Shirley,
>  
> I think this query would give you an error currently, for two reasons:
>  
> 1. The select list does not contain the group by column and distinct 
> track group by user is indeterminate - what value of track do you want 
> to report here on the group of users?
> 2. We do not have order by yet though you can sort of simulate it with 
> a sort by clause and 1 reducer.
>  
> Can you explain more in terms of what you want the SQL to achieve?
>  
> Also the hive wiki contains a number of presentations that describe 
> how SQL gets converted to Map/Reduce plans at a high level. Check them 
> out at
>  
> http://wiki.apache.org/hadoop/Hive/Presentations
>  
> Ashish
>
> ------------------------------------------------------------------------
> *From:* Jeff Hammerbacher [mailto:hammer@cloudera.com]
> *Sent:* Saturday, December 20, 2008 11:35 AM
> *To:* hive-user@hadoop.apache.org
> *Subject:* Re: question about mapred translation
>
> Hey Shirley,
>
> Welcome to Hive! Once you've gotten Hive up and running and have 
> created the "music" table, you should be able to say "EXPLAIN 
> <query>", where <query> is the query specified below (or any other 
> query). For more detailed information, you can say "EXPLAIN EXTENDED 
> <query>".
>
> The output from the EXPLAIN might be a little obtuse, so if you still 
> have questions, I'm sure more knowledgeable Hive folks can give you 
> insight into the planner.
>
> Regards,
> Jeff
>
> On Sat, Dec 20, 2008 at 11:48 AM, Shirley Cohen <scohen@cs.utexas.edu 
> <ma...@cs.utexas.edu>> wrote:
>
>     Hi,
>
>     I'm just getting started with Hive. I was wondering if anyone can
>     tell me how Hive would translate the following SQL query into
>     MapReduce:
>
>     select distinct track, count(track)
>     from music
>     where date_listened between '12-10-2008' and '12-11-2008'
>     group by user
>     order by count(track) desc
>
>     How many mapred jobs would it use? How would the map and red
>     functions looks like? Also, does Hive have a utility that gives
>     you this information?
>
>     Thanks,
>
>     Shirley
>
>
>
>

RE: question about mapred translation

Posted by Ashish Thusoo <at...@facebook.com>.

Hi Shirley,

I think this query would give you an error currently, for two reasons:

1. The select list does not contain the group by column and distinct track group by user is indeterminate - what value of track do you want to report here on the group of users?
2. We do not have order by yet though you can sort of simulate it with a sort by clause and 1 reducer.

Can you explain more in terms of what you want the SQL to achieve?

Also the hive wiki contains a number of presentations that describe how SQL gets converted to Map/Reduce plans at a high level. Check them out at

http://wiki.apache.org/hadoop/Hive/Presentations

Ashish

________________________________
From: Jeff Hammerbacher [mailto:hammer@cloudera.com]
Sent: Saturday, December 20, 2008 11:35 AM
To: hive-user@hadoop.apache.org
Subject: Re: question about mapred translation

Hey Shirley,

Welcome to Hive! Once you've gotten Hive up and running and have created the "music" table, you should be able to say "EXPLAIN <query>", where <query> is the query specified below (or any other query). For more detailed information, you can say "EXPLAIN EXTENDED <query>".

The output from the EXPLAIN might be a little obtuse, so if you still have questions, I'm sure more knowledgeable Hive folks can give you insight into the planner.

Regards,
Jeff

On Sat, Dec 20, 2008 at 11:48 AM, Shirley Cohen <sc...@cs.utexas.edu>> wrote:
Hi,

I'm just getting started with Hive. I was wondering if anyone can tell me how Hive would translate the following SQL query into MapReduce:

select distinct track, count(track)
from music
where date_listened between '12-10-2008' and '12-11-2008'
group by user
order by count(track) desc

How many mapred jobs would it use? How would the map and red functions looks like? Also, does Hive have a utility that gives you this information?

Thanks,

Shirley

Re: question about mapred translation

Posted by Jeff Hammerbacher <ha...@cloudera.com>.

Hey Shirley,

Welcome to Hive! Once you've gotten Hive up and running and have created the
"music" table, you should be able to say "EXPLAIN <query>", where <query> is
the query specified below (or any other query). For more detailed
information, you can say "EXPLAIN EXTENDED <query>".

The output from the EXPLAIN might be a little obtuse, so if you still have
questions, I'm sure more knowledgeable Hive folks can give you insight into
the planner.

Regards,
Jeff

On Sat, Dec 20, 2008 at 11:48 AM, Shirley Cohen <sc...@cs.utexas.edu>wrote:

> Hi,
>
> I'm just getting started with Hive. I was wondering if anyone can tell me
> how Hive would translate the following SQL query into MapReduce:
>
> select distinct track, count(track)
> from music
> where date_listened between '12-10-2008' and '12-11-2008'
> group by user
> order by count(track) desc
>
> How many mapred jobs would it use? How would the map and red functions
> looks like? Also, does Hive have a utility that gives you this information?
>
> Thanks,
>
> Shirley
>
>
>
>