You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Kortni Smith <ks...@abebooks.com> on 2010/05/25 21:13:41 UTC

rownum, row_number() or looping ability with hiveql?

Hi,

 

Is there a hive equivalent to Oracle's rownum, row_number() or the ability to
loop through a resultset?

 

I have been struggling to create a hive query that will give me max X
records, per something, when sorted by something.  For example, I have book
data, multiple records for any given isbn, and want the lowest 5 priced books
per isbn.

 

I can accomplish this in oracle with the following:

 

select isbn, price from

(

select isbn, price, row_number() over (partition by isbn order by price asc)
rn 

from kstest

)

where rn <= 5;

 

Any  ideas would be greatly appreciated.

Thank you,

 

Kortni Smith | Software Developer
AbeBooks.com   <http://www.abebooks.com/> Passion for books.

ksmith@abebooks.com <ma...@abebooks.com> 
phone: 250.412.3272  |  fax: 250.475.6014

Suite 500 - 655 Tyee Rd. Victoria, BC. Canada V9A 6X5

www.abebooks.com  |  www.abebooks.co.uk  |  www.abebooks.de
www.abebooks.fr  |  www.abebooks.it  |  www.iberlibro.com

Re: rownum, row_number() or looping ability with hiveql?

Posted by Sonal Goyal <so...@gmail.com>.

Hi Kortni,

You can define an aggregate function for this and write custom code.

Thanks and Regards,
Sonal
www.meghsoft.com
http://in.linkedin.com/in/sonalgoyal



On Wed, May 26, 2010 at 12:43 AM, Kortni Smith <ks...@abebooks.com> wrote:
> Hi,
>
>
>
> Is there a hive equivalent to Oracle’s rownum, row_number() or the ability
> to loop through a resultset?
>
>
>
> I have been struggling to create a hive query that will give me max X
> records, per something, when sorted by something.  For example, I have book
> data, multiple records for any given isbn, and want the lowest 5 priced
> books per isbn.
>
>
>
> I can accomplish this in oracle with the following:
>
>
>
> select isbn, price from
>
> (
>
> select isbn, price, row_number() over (partition by isbn order by price asc)
> rn
>
> from kstest
>
> )
>
> where rn <= 5;
>
>
>
> Any  ideas would be greatly appreciated.
>
> Thank you,
>
>
>
> Kortni Smith | Software Developer
> AbeBooks.com  Passion for books.
>
> ksmith@abebooks.com
> phone: 250.412.3272  |  fax: 250.475.6014
>
> Suite 500 - 655 Tyee Rd. Victoria, BC. Canada V9A 6X5
>
> www.abebooks.com  |  www.abebooks.co.uk  |  www.abebooks.de
> www.abebooks.fr  |  www.abebooks.it  |  www.iberlibro.com
>
>

Re: rownum, row_number() or looping ability with hiveql? - LIMIT clause

Posted by John Sichi <js...@facebook.com>.

If you write a similar UDF which takes the partitioning column (isbn in your example) as input and restarts the sequence at 0 whenever it sees a new value in this column, then you can combine it with

ORDER BY isbn,price

in a subselect to get the result you want.  Whether or not this is good enough depends on the amount of data on which the ranking filter is being applied (since ORDER BY currently forces everything to go through a single reducer).

JVS

On May 26, 2010, at 1:36 PM, Kortni Smith wrote:

Thank Josh, If I could use the LIMIT clause I’d like to, but since I need to impose a limit on each group (max 10 results PER isbn’s for example), and hive does not support subqueries in the where clause, I’m at a loss for how to implement this.  Any examples or further thoughts on this?

We’re currently evaluating whether to use pig or hive – and this is one thing that was easier to implement with pig.  Am I overlooking a simple approach?
Thanks again for any assistance on this.

Kortni

From: John Sichi [mailto:jsichi@facebook.com]
Sent: Tuesday, May 25, 2010 12:21 PM
To: hive-user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: rownum, row_number() or looping ability with hiveql?

In your simple example, you can probably use the LIMIT clause, but for more advanced cases, here's a patch for a ROW_SEQUENCE UDF (not committed to trunk yet):

https://issues.apache.org/jira/browse/HIVE-1304

The caveat is that since we don't actually have a SQL/OLAP implementation yet, you have to use ORDER BY at the nested query level (rather than the OVER clause level where it belongs) and cross your fingers.

JVS

On May 25, 2010, at 12:13 PM, Kortni Smith wrote:


Hi,

Is there a hive equivalent to Oracle’s rownum, row_number() or the ability to loop through a resultset?

I have been struggling to create a hive query that will give me max X records, per something, when sorted by something.  For example, I have book data, multiple records for any given isbn, and want the lowest 5 priced books per isbn.

I can accomplish this in oracle with the following:

select isbn, price from
(
select isbn, price, row_number() over (partition by isbn order by price asc) rn
from kstest
)
where rn <= 5;

Any  ideas would be greatly appreciated.
Thank you,

Kortni Smith | Software Developer
AbeBooks.com  <http://www.abebooks.com/> Passion for books.

ksmith@abebooks.com<ma...@abebooks.com>
phone: 250.412.3272  |  fax: 250.475.6014

Suite 500 - 655 Tyee Rd. Victoria, BC. Canada V9A 6X5

www.abebooks.com  <x-msg://24/www.abebooks.com  > |  www.abebooks.co.uk<x-msg://24/www.abebooks.co.uk>  |  www.abebooks.de<x-msg://24/www.abebooks.de>
www.abebooks.fr  <x-msg://24/www.abebooks.fr  > |  www.abebooks.it<x-msg://24/www.abebooks.it>  |  www.iberlibro.com<x-msg://24/www.iberlibro.com>

RE: rownum, row_number() or looping ability with hiveql? - LIMIT clause

Posted by Kortni Smith <ks...@abebooks.com>.

Thank Josh, If I could use the LIMIT clause I'd like to, but since I need to
impose a limit on each group (max 10 results PER isbn's for example), and
hive does not support subqueries in the where clause, I'm at a loss for how
to implement this.  Any examples or further thoughts on this?

 

We're currently evaluating whether to use pig or hive - and this is one thing
that was easier to implement with pig.  Am I overlooking a simple approach?

Thanks again for any assistance on this.

 

Kortni

 

From: John Sichi [mailto:jsichi@facebook.com] 
Sent: Tuesday, May 25, 2010 12:21 PM
To: hive-user@hadoop.apache.org
Subject: Re: rownum, row_number() or looping ability with hiveql?

 

In your simple example, you can probably use the LIMIT clause, but for more
advanced cases, here's a patch for a ROW_SEQUENCE UDF (not committed to trunk
yet):

 

https://issues.apache.org/jira/browse/HIVE-1304

 

The caveat is that since we don't actually have a SQL/OLAP implementation
yet, you have to use ORDER BY at the nested query level (rather than the OVER
clause level where it belongs) and cross your fingers.

 

JVS

 

On May 25, 2010, at 12:13 PM, Kortni Smith wrote:





Hi,

 

Is there a hive equivalent to Oracle's rownum, row_number() or the ability to
loop through a resultset?

 

I have been struggling to create a hive query that will give me max X
records, per something, when sorted by something.  For example, I have book
data, multiple records for any given isbn, and want the lowest 5 priced books
per isbn.

 

I can accomplish this in oracle with the following:

 

select isbn, price from

(

select isbn, price, row_number() over (partition by isbn order by price asc)
rn

from kstest

)

where rn <= 5;

 

Any  ideas would be greatly appreciated.

Thank you,

 

Kortni Smith | Software Developer
AbeBooks.com   <http://www.abebooks.com/> Passion for books.

ksmith@abebooks.com <ma...@abebooks.com> 
phone: 250.412.3272  |  fax: 250.475.6014

Suite 500 - 655 Tyee Rd. Victoria, BC. Canada V9A 6X5

www.abebooks.com   <x-msg://24/www.abebooks.com  > |  www.abebooks.co.uk
<x-msg://24/www.abebooks.co.uk>   |  www.abebooks.de
<x-msg://24/www.abebooks.de> 
www.abebooks.fr   <x-msg://24/www.abebooks.fr  > |  www.abebooks.it
<x-msg://24/www.abebooks.it>   |  www.iberlibro.com
<x-msg://24/www.iberlibro.com>

Re: rownum, row_number() or looping ability with hiveql?

Posted by John Sichi <js...@facebook.com>.

In your simple example, you can probably use the LIMIT clause, but for more advanced cases, here's a patch for a ROW_SEQUENCE UDF (not committed to trunk yet):

https://issues.apache.org/jira/browse/HIVE-1304

The caveat is that since we don't actually have a SQL/OLAP implementation yet, you have to use ORDER BY at the nested query level (rather than the OVER clause level where it belongs) and cross your fingers.

JVS

On May 25, 2010, at 12:13 PM, Kortni Smith wrote:

Hi,

Is there a hive equivalent to Oracle’s rownum, row_number() or the ability to loop through a resultset?

I have been struggling to create a hive query that will give me max X records, per something, when sorted by something.  For example, I have book data, multiple records for any given isbn, and want the lowest 5 priced books per isbn.

I can accomplish this in oracle with the following:

select isbn, price from
(
select isbn, price, row_number() over (partition by isbn order by price asc) rn
from kstest
)
where rn <= 5;

Any  ideas would be greatly appreciated.
Thank you,

Kortni Smith | Software Developer
AbeBooks.com  <http://www.abebooks.com/> Passion for books.

ksmith@abebooks.com<ma...@abebooks.com>
phone: 250.412.3272  |  fax: 250.475.6014

Suite 500 - 655 Tyee Rd. Victoria, BC. Canada V9A 6X5

www.abebooks.com  |  www.abebooks.co.uk  |  www.abebooks.de
www.abebooks.fr  |  www.abebooks.it  |  www.iberlibro.com