You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Mike Roberts <mi...@spyfu.com> on 2010/01/26 07:47:01 UTC

How can I implement a cursor in Hive? ...or... Can I implement a CROSS APPLY in Hive?...or... How can I do a FOR or WHILE loop (inside or outside) of Hive?

I'm trying to use Hive to solve a fairly common SQL scenario that I run into.  I have boiled the problem down into its most basic form:

You have a table of transactions defined as so:  CREATE TABLE transactions (product_id INT, customer_id INT)

|--------------------|
|--Transactions------|
|---product_id (INT)-|
|---customer_id(INT)-|
|--------------------|


****The goal is simple: For each product, produce a list of the top 5 largest customers.****

So, the base query would look like this:

SELECT product_id, customer_id, count(*) as products_bought
FROM transactions
GROUP BY product_id, customer_id

You could insert that value into another table called products_bought defined as:
CREATE TABLE prod_bought
(product_id INT, customer_id INT, products_bought INT)

Now you have an intermediate result that tells you how many times each customer bought each product.  But, obviously, that doesn't completely solve the problem.

At this point, in order to solve the problem, you'd have to use a cursor or a CROSS APPLY.  Here's an example in T-SQL:

--THE CURSOR METHOD:

DECLARE @productId int;
DECLARE product_cur CURSOR FOR
SELECT DISTINCT product_id
FROM transactions t

OPEN product_cur

FETCH product_cur into @productId

WHILE (@@FETCH_STATUS <> -1)
BEGIN

   FETCH product_cur into @productId
   INSERT top_customers_by_product
                SELECT TOP 5 product_id, customer_id, products_bought
                FROM prod_bought
                WHERE product_id = @productId
                ORDER BY products_bought desc


END
CLOSE Domains
DEALLOCATE Domains


--THE CROSS APPLY METHOD:

--First create a user defined function
CREATE FUNCTION dbo.fn_GetTopXCustomers(@ProductId INT)
RETURNS TABLE
AS
RETURN
                SELECT TOP 5 product_id, customer_id, products_bought
                FROM prod_bought
                WHERE product_id = @productId
                ORDER BY products_bought desc
GO

--Build a table of distinct product Ids
SELECT DISTINCT product_id INTO temp_distinct_product_ids FROM transactions

--Run the CROSS APPLY
SELECT A.product_id
, A.customer_id
, A.products_bought
INTO top_customers_by_product
FROM temp_distinct_product_ids T
CROSS APPLY dbo.fn_GetTopXCustomers(T.product_id) A


Okay, so there are two ways I could solve the problem in SQL (CROSS APPLY is dramatically faster for anyone that cares).  How can I do the same thing in Hive?  Here's the question restated: How can I implement a cursor in Hive?  How can I do a for or while loop in Hive?  Can I implement a CROSS APPLY in Hive?


I realize that I can implement a cursor outside of Hive and just execute the same Hive script over and over and over again.  And, that's not a horrible solution as long as it leverages the full power of Hadoop.  My concern is that each of the individual queries that is run inside are fairly inexpensive, but the total number of products makes the total job *very* expensive.

Also, the solution should be reusable -- I'd really prefer not to write a custom jar every time I run into this problem.

Actually, I'm also not particularly religious about using Hive.  If there's some other tech that does what I need, that's cool too.

Thanks in advance.

Mike Roberts

Hive UDAF Project Posted on Rentacoder

Posted by Mike Roberts <mi...@spyfu.com>.
I need that max_n UDAF developed that Zheng suggests below.  I have it open for bidding on Rentacoder here:  http://www.rentacoder.com/RentACoder/misc/BidRequests/ShowBidRequest.asp?lngBidRequestId=1335150

I'll post the working code when I get it.

--Mike

-----Original Message-----
From: Zheng Shao [mailto:zshao9@gmail.com] 
Sent: Tuesday, January 26, 2010 12:03 AM
To: hive-user@hadoop.apache.org; Paul Yang
Subject: Re: How can I implement a cursor in Hive? ...or... Can I implement a CROSS APPLY in Hive?...or... How can I do a FOR or WHILE loop (inside or outside) of Hive?

We can use a combination of UDAF and LATERAL VIEW to implement what you want.

1. Define a UDAF like this: max_n(5, products_bought, customer_id)
which returns the top 5 products_bought and their customer_id in type
of array<struct<col0:int,col1:int>>
2. Use the Lateral views (with explode) to transform a single row into
multiple rows.

SELECT t.product_id, t5.products_bought, t5.customer_id
FROM (
  SELECT product_id, max_n(5, products_bought, customer_id) as top5
  FROM temp
  GROUP BY product_id) t LATERAL VIEW explode(t.top5) t5 AS
products_bought, customer_id;

See http://wiki.apache.org/hadoop/Hive/LanguageManual/LateralView
Paul is the author of UDTF and Lateral view. He might be able to give
you more details.

Zheng

On Mon, Jan 25, 2010 at 10:47 PM, Mike Roberts <mi...@spyfu.com> wrote:
>
>
> I'm trying to use Hive to solve a fairly common SQL scenario that I run
> into.  I have boiled the problem down into its most basic form:
>
>
>
> You have a table of transactions defined as so:  CREATE TABLE transactions
> (product_id INT, customer_id INT)
>
>
>
> |--------------------|
>
> |--Transactions------|
>
> |---product_id (INT)-|
>
> |---customer_id(INT)-|
>
> |--------------------|
>
>
>
>
>
> ****The goal is simple: For each product, produce a list of the top 5
> largest customers.****
>
>
>
> So, the base query would look like this:
>
>
>
> SELECT product_id, customer_id, count(*) as products_bought
>
> FROM transactions
>
> GROUP BY product_id, customer_id
>
>
>
> You could insert that value into another table called products_bought
> defined as:
>
> CREATE TABLE prod_bought
>
> (product_id INT, customer_id INT, products_bought INT)
>
>
>
> Now you have an intermediate result that tells you how many times each
> customer bought each product.  But, obviously, that doesn't completely solve
> the problem.
>
>
>
> At this point, in order to solve the problem, you'd have to use a cursor or
> a CROSS APPLY.  Here's an example in T-SQL:
>
>
>
> --THE CURSOR METHOD:
>
>
>
> DECLARE @productId int;
>
> DECLARE product_cur CURSOR FOR
>
> SELECT DISTINCT product_id
>
> FROM transactions t
>
>
>
> OPEN product_cur
>
>
>
> FETCH product_cur into @productId
>
>
>
> WHILE (@@FETCH_STATUS <> -1)
>
> BEGIN
>
>
>
>    FETCH product_cur into @productId
>
>    INSERT top_customers_by_product
>
>                 SELECT TOP 5 product_id, customer_id, products_bought
>
>                 FROM prod_bought
>
>                 WHERE product_id = @productId
>
>                 ORDER BY products_bought desc
>
>
>
>
>
> END
>
> CLOSE Domains
>
> DEALLOCATE Domains
>
>
>
>
>
> --THE CROSS APPLY METHOD:
>
>
>
> --First create a user defined function
>
> CREATE FUNCTION dbo.fn_GetTopXCustomers(@ProductId INT)
>
> RETURNS TABLE
>
> AS
>
> RETURN
>
>                 SELECT TOP 5 product_id, customer_id, products_bought
>
>                 FROM prod_bought
>
>                 WHERE product_id = @productId
>
>                 ORDER BY products_bought desc
>
> GO
>
>
>
> --Build a table of distinct product Ids
>
> SELECT DISTINCT product_id INTO temp_distinct_product_ids FROM transactions
>
>
>
> --Run the CROSS APPLY
>
> SELECT A.product_id
>
> , A.customer_id
>
> , A.products_bought
>
> INTO top_customers_by_product
>
> FROM temp_distinct_product_ids T
>
> CROSS APPLY dbo.fn_GetTopXCustomers(T.product_id) A
>
>
>
>
>
> Okay, so there are two ways I could solve the problem in SQL (CROSS APPLY is
> dramatically faster for anyone that cares).  How can I do the same thing in
> Hive?  Here's the question restated: How can I implement a cursor in Hive?
> How can I do a for or while loop in Hive?  Can I implement a CROSS APPLY in
> Hive?
>
>
>
>
>
> I realize that I can implement a cursor outside of Hive and just execute the
> same Hive script over and over and over again.  And, that's not a horrible
> solution as long as it leverages the full power of Hadoop.  My concern is
> that each of the individual queries that is run inside are fairly
> inexpensive, but the total number of products makes the total job *very*
> expensive.
>
>
>
> Also, the solution should be reusable -- I'd really prefer not to write a
> custom jar every time I run into this problem.
>
>
>
> Actually, I'm also not particularly religious about using Hive.  If there's
> some other tech that does what I need, that's cool too.
>
>
>
> Thanks in advance.
>
>
>
> Mike Roberts



-- 
Yours,
Zheng

Re: How can I implement a cursor in Hive? ...or... Can I implement a CROSS APPLY in Hive?...or... How can I do a FOR or WHILE loop (inside or outside) of Hive?

Posted by Zheng Shao <zs...@gmail.com>.
We can use a combination of UDAF and LATERAL VIEW to implement what you want.

1. Define a UDAF like this: max_n(5, products_bought, customer_id)
which returns the top 5 products_bought and their customer_id in type
of array<struct<col0:int,col1:int>>
2. Use the Lateral views (with explode) to transform a single row into
multiple rows.

SELECT t.product_id, t5.products_bought, t5.customer_id
FROM (
  SELECT product_id, max_n(5, products_bought, customer_id) as top5
  FROM temp
  GROUP BY product_id) t LATERAL VIEW explode(t.top5) t5 AS
products_bought, customer_id;

See http://wiki.apache.org/hadoop/Hive/LanguageManual/LateralView
Paul is the author of UDTF and Lateral view. He might be able to give
you more details.

Zheng

On Mon, Jan 25, 2010 at 10:47 PM, Mike Roberts <mi...@spyfu.com> wrote:
>
>
> I'm trying to use Hive to solve a fairly common SQL scenario that I run
> into.  I have boiled the problem down into its most basic form:
>
>
>
> You have a table of transactions defined as so:  CREATE TABLE transactions
> (product_id INT, customer_id INT)
>
>
>
> |--------------------|
>
> |--Transactions------|
>
> |---product_id (INT)-|
>
> |---customer_id(INT)-|
>
> |--------------------|
>
>
>
>
>
> ****The goal is simple: For each product, produce a list of the top 5
> largest customers.****
>
>
>
> So, the base query would look like this:
>
>
>
> SELECT product_id, customer_id, count(*) as products_bought
>
> FROM transactions
>
> GROUP BY product_id, customer_id
>
>
>
> You could insert that value into another table called products_bought
> defined as:
>
> CREATE TABLE prod_bought
>
> (product_id INT, customer_id INT, products_bought INT)
>
>
>
> Now you have an intermediate result that tells you how many times each
> customer bought each product.  But, obviously, that doesn't completely solve
> the problem.
>
>
>
> At this point, in order to solve the problem, you'd have to use a cursor or
> a CROSS APPLY.  Here's an example in T-SQL:
>
>
>
> --THE CURSOR METHOD:
>
>
>
> DECLARE @productId int;
>
> DECLARE product_cur CURSOR FOR
>
> SELECT DISTINCT product_id
>
> FROM transactions t
>
>
>
> OPEN product_cur
>
>
>
> FETCH product_cur into @productId
>
>
>
> WHILE (@@FETCH_STATUS <> -1)
>
> BEGIN
>
>
>
>    FETCH product_cur into @productId
>
>    INSERT top_customers_by_product
>
>                 SELECT TOP 5 product_id, customer_id, products_bought
>
>                 FROM prod_bought
>
>                 WHERE product_id = @productId
>
>                 ORDER BY products_bought desc
>
>
>
>
>
> END
>
> CLOSE Domains
>
> DEALLOCATE Domains
>
>
>
>
>
> --THE CROSS APPLY METHOD:
>
>
>
> --First create a user defined function
>
> CREATE FUNCTION dbo.fn_GetTopXCustomers(@ProductId INT)
>
> RETURNS TABLE
>
> AS
>
> RETURN
>
>                 SELECT TOP 5 product_id, customer_id, products_bought
>
>                 FROM prod_bought
>
>                 WHERE product_id = @productId
>
>                 ORDER BY products_bought desc
>
> GO
>
>
>
> --Build a table of distinct product Ids
>
> SELECT DISTINCT product_id INTO temp_distinct_product_ids FROM transactions
>
>
>
> --Run the CROSS APPLY
>
> SELECT A.product_id
>
> , A.customer_id
>
> , A.products_bought
>
> INTO top_customers_by_product
>
> FROM temp_distinct_product_ids T
>
> CROSS APPLY dbo.fn_GetTopXCustomers(T.product_id) A
>
>
>
>
>
> Okay, so there are two ways I could solve the problem in SQL (CROSS APPLY is
> dramatically faster for anyone that cares).  How can I do the same thing in
> Hive?  Here's the question restated: How can I implement a cursor in Hive?
> How can I do a for or while loop in Hive?  Can I implement a CROSS APPLY in
> Hive?
>
>
>
>
>
> I realize that I can implement a cursor outside of Hive and just execute the
> same Hive script over and over and over again.  And, that's not a horrible
> solution as long as it leverages the full power of Hadoop.  My concern is
> that each of the individual queries that is run inside are fairly
> inexpensive, but the total number of products makes the total job *very*
> expensive.
>
>
>
> Also, the solution should be reusable -- I'd really prefer not to write a
> custom jar every time I run into this problem.
>
>
>
> Actually, I’m also not particularly religious about using Hive.  If there’s
> some other tech that does what I need, that’s cool too.
>
>
>
> Thanks in advance.
>
>
>
> Mike Roberts



-- 
Yours,
Zheng