You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Tom Nichols <tm...@gmail.com> on 2010/02/23 21:46:31 UTC

selecting data where max is found

Hi Hive users,

I've got a somewhat convoluted query which I'm wondering how I would
translate it to hive...  It is similar to the first FAQ example here:
http://www.techonthenet.com/sql/max.php
So following the example above, I could easily get the last run date
for each report, but how would I get the _user_ who last ran the
report?  It involves a join based on a subquery.  Conceptually this
seems like it wouldn't be difficult because I can carry all of the
columns while searching for a 'max' of one column in a M-R job...  it
just doesn't express itself in SQL quite the same way.

Is there a way to put the result of my inner select directly into
another Hive table and then my outer select would just join on that
table?

Thanks.

Re: selecting data where max is found

Posted by Tom Nichols <tm...@gmail.com>.
I'm using Hive 0.4.1 and Hadoop 0.20.3.  Thanks guys I'll try out
those suggestions.

On Tue, Feb 23, 2010 at 4:27 PM, Zheng Shao <zs...@gmail.com> wrote:
> 1. Hive does not support "FROM a, b" kind of join. You can use "FROM a
> JOIN b ON ..".
>
> 2. You might want to use the following simpler (and faster) query:
>
> SELECT report_name, m.col1 as report_run_date, m.col2 as user_name
> FROM (
>  SELECT  report_name, max(struct(report_run_date, user_name)) as m
>  FROM report_history
> ) tmp;
>
> The struct udf is added recently. It's only in trunk.
> Zheng
>
> On Tue, Feb 23, 2010 at 1:14 PM, Tom Nichols <tm...@gmail.com> wrote:
>> Really?  It looked to me like Hive was complaining about the comma at
>> the end of "from report_history rh,"
>>
>> SELECT rh.user_name, rh.report_name, rh.report_run_date
>> FROM report_history rh,
>>   (SELECT max(report_run_date) as maxdate, report_name
>>     FROM report_history
>>     GROUP BY report_name) maxresults
>> WHERE rh.report_name = maxresults.report_name
>> AND rh.report_run_date= maxresults.maxdate;
>>
>> FAILED: Parse Error: line 2:20 mismatched input ',' expecting EOF
>>
>> I was able to perform the same task by using an intermediate table for
>> the inner query but that obviously complicates things and I'd suspect
>> makes it take longer.
>>
>> Thanks again for the help!
>> -Tom
>>
>>
>> On Tue, Feb 23, 2010 at 4:00 PM, Carl Steinbach <ca...@cloudera.com> wrote:
>>> Tom,
>>>
>>> The example in the link will compile and execute on Hive without
>>> modification.
>>>
>>> Carl
>>>
>>> On Tue, Feb 23, 2010 at 12:46 PM, Tom Nichols <tm...@gmail.com> wrote:
>>>>
>>>> Hi Hive users,
>>>>
>>>> I've got a somewhat convoluted query which I'm wondering how I would
>>>> translate it to hive...  It is similar to the first FAQ example here:
>>>> http://www.techonthenet.com/sql/max.php
>>>> So following the example above, I could easily get the last run date
>>>> for each report, but how would I get the _user_ who last ran the
>>>> report?  It involves a join based on a subquery.  Conceptually this
>>>> seems like it wouldn't be difficult because I can carry all of the
>>>> columns while searching for a 'max' of one column in a M-R job...  it
>>>> just doesn't express itself in SQL quite the same way.
>>>>
>>>> Is there a way to put the result of my inner select directly into
>>>> another Hive table and then my outer select would just join on that
>>>> table?
>>>>
>>>> Thanks.
>>>
>>>
>>
>
>
>
> --
> Yours,
> Zheng
>

Re: selecting data where max is found

Posted by Zheng Shao <zs...@gmail.com>.
1. Hive does not support "FROM a, b" kind of join. You can use "FROM a
JOIN b ON ..".

2. You might want to use the following simpler (and faster) query:

SELECT report_name, m.col1 as report_run_date, m.col2 as user_name
FROM (
  SELECT  report_name, max(struct(report_run_date, user_name)) as m
  FROM report_history
) tmp;

The struct udf is added recently. It's only in trunk.
Zheng

On Tue, Feb 23, 2010 at 1:14 PM, Tom Nichols <tm...@gmail.com> wrote:
> Really?  It looked to me like Hive was complaining about the comma at
> the end of "from report_history rh,"
>
> SELECT rh.user_name, rh.report_name, rh.report_run_date
> FROM report_history rh,
>   (SELECT max(report_run_date) as maxdate, report_name
>     FROM report_history
>     GROUP BY report_name) maxresults
> WHERE rh.report_name = maxresults.report_name
> AND rh.report_run_date= maxresults.maxdate;
>
> FAILED: Parse Error: line 2:20 mismatched input ',' expecting EOF
>
> I was able to perform the same task by using an intermediate table for
> the inner query but that obviously complicates things and I'd suspect
> makes it take longer.
>
> Thanks again for the help!
> -Tom
>
>
> On Tue, Feb 23, 2010 at 4:00 PM, Carl Steinbach <ca...@cloudera.com> wrote:
>> Tom,
>>
>> The example in the link will compile and execute on Hive without
>> modification.
>>
>> Carl
>>
>> On Tue, Feb 23, 2010 at 12:46 PM, Tom Nichols <tm...@gmail.com> wrote:
>>>
>>> Hi Hive users,
>>>
>>> I've got a somewhat convoluted query which I'm wondering how I would
>>> translate it to hive...  It is similar to the first FAQ example here:
>>> http://www.techonthenet.com/sql/max.php
>>> So following the example above, I could easily get the last run date
>>> for each report, but how would I get the _user_ who last ran the
>>> report?  It involves a join based on a subquery.  Conceptually this
>>> seems like it wouldn't be difficult because I can carry all of the
>>> columns while searching for a 'max' of one column in a M-R job...  it
>>> just doesn't express itself in SQL quite the same way.
>>>
>>> Is there a way to put the result of my inner select directly into
>>> another Hive table and then my outer select would just join on that
>>> table?
>>>
>>> Thanks.
>>
>>
>



-- 
Yours,
Zheng

Re: selecting data where max is found

Posted by Tom Nichols <tm...@gmail.com>.
Carl!  Your rewrite of the query did the trick.  I see how change it
to use a join instead.  I'm definitely not a SQL wiz :)  Thanks!

Looking forward to trying out 0.5.0.

On Tue, Feb 23, 2010 at 4:27 PM, Carl Steinbach <ca...@cloudera.com> wrote:
> Also, what happens if you rewrite the query using explicit JOIN syntax?
>
> SELECT rh.user_name, rh.report_name, rh.report_run_date
> FROM
>   (SELECT max(report_run_date) as maxdate, report_name
>    FROM report_history
>    GROUP BY report_name) maxresults
>  JOIN report_history rh
>  ON (rh.report_name = maxresults.report_name AND rh.report_run_date =
> maxresults.maxdata)

Re: selecting data where max is found

Posted by Carl Steinbach <ca...@cloudera.com>.
Hi Tom,

Which version of Hive are you using? This may not work on Hive 0.4 (see
http://issues.apache.org/jira/browse/HIVE-915)

Also, what happens if you rewrite the query using explicit JOIN syntax?

SELECT rh.user_name, rh.report_name, rh.report_run_date
FROM
  (SELECT max(report_run_date) as maxdate, report_name
   FROM report_history
   GROUP BY report_name) maxresults
 JOIN report_history rh
 ON (rh.report_name = maxresults.report_name AND rh.report_run_date =
maxresults.maxdata)

Carl

On Tue, Feb 23, 2010 at 1:14 PM, Tom Nichols <tm...@gmail.com> wrote:

> Really?  It looked to me like Hive was complaining about the comma at
> the end of "from report_history rh,"
>
> SELECT rh.user_name, rh.report_name, rh.report_run_date
> FROM report_history rh,
>   (SELECT max(report_run_date) as maxdate, report_name
>     FROM report_history
>     GROUP BY report_name) maxresults
> WHERE rh.report_name = maxresults.report_name
> AND rh.report_run_date= maxresults.maxdate;
>
> FAILED: Parse Error: line 2:20 mismatched input ',' expecting EOF
>
> I was able to perform the same task by using an intermediate table for
> the inner query but that obviously complicates things and I'd suspect
> makes it take longer.
>
> Thanks again for the help!
> -Tom
>
>
> On Tue, Feb 23, 2010 at 4:00 PM, Carl Steinbach <ca...@cloudera.com> wrote:
> > Tom,
> >
> > The example in the link will compile and execute on Hive without
> > modification.
> >
> > Carl
> >
> > On Tue, Feb 23, 2010 at 12:46 PM, Tom Nichols <tm...@gmail.com>
> wrote:
> >>
> >> Hi Hive users,
> >>
> >> I've got a somewhat convoluted query which I'm wondering how I would
> >> translate it to hive...  It is similar to the first FAQ example here:
> >> http://www.techonthenet.com/sql/max.php
> >> So following the example above, I could easily get the last run date
> >> for each report, but how would I get the _user_ who last ran the
> >> report?  It involves a join based on a subquery.  Conceptually this
> >> seems like it wouldn't be difficult because I can carry all of the
> >> columns while searching for a 'max' of one column in a M-R job...  it
> >> just doesn't express itself in SQL quite the same way.
> >>
> >> Is there a way to put the result of my inner select directly into
> >> another Hive table and then my outer select would just join on that
> >> table?
> >>
> >> Thanks.
> >
> >
>

Re: selecting data where max is found

Posted by Tom Nichols <tm...@gmail.com>.
Really?  It looked to me like Hive was complaining about the comma at
the end of "from report_history rh,"

SELECT rh.user_name, rh.report_name, rh.report_run_date
FROM report_history rh,
   (SELECT max(report_run_date) as maxdate, report_name
     FROM report_history
     GROUP BY report_name) maxresults
WHERE rh.report_name = maxresults.report_name
AND rh.report_run_date= maxresults.maxdate;

FAILED: Parse Error: line 2:20 mismatched input ',' expecting EOF

I was able to perform the same task by using an intermediate table for
the inner query but that obviously complicates things and I'd suspect
makes it take longer.

Thanks again for the help!
-Tom


On Tue, Feb 23, 2010 at 4:00 PM, Carl Steinbach <ca...@cloudera.com> wrote:
> Tom,
>
> The example in the link will compile and execute on Hive without
> modification.
>
> Carl
>
> On Tue, Feb 23, 2010 at 12:46 PM, Tom Nichols <tm...@gmail.com> wrote:
>>
>> Hi Hive users,
>>
>> I've got a somewhat convoluted query which I'm wondering how I would
>> translate it to hive...  It is similar to the first FAQ example here:
>> http://www.techonthenet.com/sql/max.php
>> So following the example above, I could easily get the last run date
>> for each report, but how would I get the _user_ who last ran the
>> report?  It involves a join based on a subquery.  Conceptually this
>> seems like it wouldn't be difficult because I can carry all of the
>> columns while searching for a 'max' of one column in a M-R job...  it
>> just doesn't express itself in SQL quite the same way.
>>
>> Is there a way to put the result of my inner select directly into
>> another Hive table and then my outer select would just join on that
>> table?
>>
>> Thanks.
>
>

Re: selecting data where max is found

Posted by Carl Steinbach <ca...@cloudera.com>.
Tom,

The example in the link will compile and execute on Hive without
modification.

Carl

On Tue, Feb 23, 2010 at 12:46 PM, Tom Nichols <tm...@gmail.com> wrote:

> Hi Hive users,
>
> I've got a somewhat convoluted query which I'm wondering how I would
> translate it to hive...  It is similar to the first FAQ example here:
> http://www.techonthenet.com/sql/max.php
> So following the example above, I could easily get the last run date
> for each report, but how would I get the _user_ who last ran the
> report?  It involves a join based on a subquery.  Conceptually this
> seems like it wouldn't be difficult because I can carry all of the
> columns while searching for a 'max' of one column in a M-R job...  it
> just doesn't express itself in SQL quite the same way.
>
> Is there a way to put the result of my inner select directly into
> another Hive table and then my outer select would just join on that
> table?
>
> Thanks.
>