You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Kortni Smith <ks...@abebooks.com> on 2010/05/26 22:36:30 UTC

RE: rownum, row_number() or looping ability with hiveql? - LIMIT clause

Thank Josh, If I could use the LIMIT clause I'd like to, but since I need to
impose a limit on each group (max 10 results PER isbn's for example), and
hive does not support subqueries in the where clause, I'm at a loss for how
to implement this.  Any examples or further thoughts on this?

 

We're currently evaluating whether to use pig or hive - and this is one thing
that was easier to implement with pig.  Am I overlooking a simple approach?

Thanks again for any assistance on this.

 

Kortni

 

From: John Sichi [mailto:jsichi@facebook.com] 
Sent: Tuesday, May 25, 2010 12:21 PM
To: hive-user@hadoop.apache.org
Subject: Re: rownum, row_number() or looping ability with hiveql?

 

In your simple example, you can probably use the LIMIT clause, but for more
advanced cases, here's a patch for a ROW_SEQUENCE UDF (not committed to trunk
yet):

 

https://issues.apache.org/jira/browse/HIVE-1304

 

The caveat is that since we don't actually have a SQL/OLAP implementation
yet, you have to use ORDER BY at the nested query level (rather than the OVER
clause level where it belongs) and cross your fingers.

 

JVS

 

On May 25, 2010, at 12:13 PM, Kortni Smith wrote:





Hi,

 

Is there a hive equivalent to Oracle's rownum, row_number() or the ability to
loop through a resultset?

 

I have been struggling to create a hive query that will give me max X
records, per something, when sorted by something.  For example, I have book
data, multiple records for any given isbn, and want the lowest 5 priced books
per isbn.

 

I can accomplish this in oracle with the following:

 

select isbn, price from

(

select isbn, price, row_number() over (partition by isbn order by price asc)
rn

from kstest

)

where rn <= 5;

 

Any  ideas would be greatly appreciated.

Thank you,

 

Kortni Smith | Software Developer
AbeBooks.com   <http://www.abebooks.com/> Passion for books.

ksmith@abebooks.com <ma...@abebooks.com> 
phone: 250.412.3272  |  fax: 250.475.6014

Suite 500 - 655 Tyee Rd. Victoria, BC. Canada V9A 6X5

www.abebooks.com   <x-msg://24/www.abebooks.com  > |  www.abebooks.co.uk
<x-msg://24/www.abebooks.co.uk>   |  www.abebooks.de
<x-msg://24/www.abebooks.de> 
www.abebooks.fr   <x-msg://24/www.abebooks.fr  > |  www.abebooks.it
<x-msg://24/www.abebooks.it>   |  www.iberlibro.com
<x-msg://24/www.iberlibro.com>

Re: rownum, row_number() or looping ability with hiveql? - LIMIT clause

Posted by John Sichi <js...@facebook.com>.

If you write a similar UDF which takes the partitioning column (isbn in your example) as input and restarts the sequence at 0 whenever it sees a new value in this column, then you can combine it with

ORDER BY isbn,price

in a subselect to get the result you want.  Whether or not this is good enough depends on the amount of data on which the ranking filter is being applied (since ORDER BY currently forces everything to go through a single reducer).

JVS

On May 26, 2010, at 1:36 PM, Kortni Smith wrote:

Thank Josh, If I could use the LIMIT clause I’d like to, but since I need to impose a limit on each group (max 10 results PER isbn’s for example), and hive does not support subqueries in the where clause, I’m at a loss for how to implement this.  Any examples or further thoughts on this?

We’re currently evaluating whether to use pig or hive – and this is one thing that was easier to implement with pig.  Am I overlooking a simple approach?
Thanks again for any assistance on this.

Kortni

From: John Sichi [mailto:jsichi@facebook.com]
Sent: Tuesday, May 25, 2010 12:21 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: rownum, row_number() or looping ability with hiveql?

In your simple example, you can probably use the LIMIT clause, but for more advanced cases, here's a patch for a ROW_SEQUENCE UDF (not committed to trunk yet):

https://issues.apache.org/jira/browse/HIVE-1304

The caveat is that since we don't actually have a SQL/OLAP implementation yet, you have to use ORDER BY at the nested query level (rather than the OVER clause level where it belongs) and cross your fingers.

JVS

On May 25, 2010, at 12:13 PM, Kortni Smith wrote:


Hi,

Is there a hive equivalent to Oracle’s rownum, row_number() or the ability to loop through a resultset?

I have been struggling to create a hive query that will give me max X records, per something, when sorted by something.  For example, I have book data, multiple records for any given isbn, and want the lowest 5 priced books per isbn.

I can accomplish this in oracle with the following:

select isbn, price from
(
select isbn, price, row_number() over (partition by isbn order by price asc) rn
from kstest
)
where rn <= 5;

Any  ideas would be greatly appreciated.
Thank you,

Kortni Smith | Software Developer
AbeBooks.com  <http://www.abebooks.com/> Passion for books.

ksmith@abebooks.com<ma...@abebooks.com>
phone: 250.412.3272  |  fax: 250.475.6014

Suite 500 - 655 Tyee Rd. Victoria, BC. Canada V9A 6X5

www.abebooks.com  <x-msg://24/www.abebooks.com  > |  www.abebooks.co.uk<x-msg://24/www.abebooks.co.uk>  |  www.abebooks.de<x-msg://24/www.abebooks.de>
www.abebooks.fr  <x-msg://24/www.abebooks.fr  > |  www.abebooks.it<x-msg://24/www.abebooks.it>  |  www.iberlibro.com<x-msg://24/www.iberlibro.com>