You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Mike Roberts <mi...@spyfu.com> on 2010/01/26 19:16:16 UTC

Hive UDAF Project Posted on Rentacoder

I need that max_n UDAF developed that Zheng suggests below.  I have it open for bidding on Rentacoder here:  http://www.rentacoder.com/RentACoder/misc/BidRequests/ShowBidRequest.asp?lngBidRequestId=1335150

I'll post the working code when I get it.

--Mike

-----Original Message-----
From: Zheng Shao [mailto:zshao9@gmail.com] 
Sent: Tuesday, January 26, 2010 12:03 AM
To: hive-user@hadoop.apache.org; Paul Yang
Subject: Re: How can I implement a cursor in Hive? ...or... Can I implement a CROSS APPLY in Hive?...or... How can I do a FOR or WHILE loop (inside or outside) of Hive?

We can use a combination of UDAF and LATERAL VIEW to implement what you want.

1. Define a UDAF like this: max_n(5, products_bought, customer_id)
which returns the top 5 products_bought and their customer_id in type
of array<struct<col0:int,col1:int>>
2. Use the Lateral views (with explode) to transform a single row into
multiple rows.

SELECT t.product_id, t5.products_bought, t5.customer_id
FROM (
  SELECT product_id, max_n(5, products_bought, customer_id) as top5
  FROM temp
  GROUP BY product_id) t LATERAL VIEW explode(t.top5) t5 AS
products_bought, customer_id;

See http://wiki.apache.org/hadoop/Hive/LanguageManual/LateralView
Paul is the author of UDTF and Lateral view. He might be able to give
you more details.

Zheng

On Mon, Jan 25, 2010 at 10:47 PM, Mike Roberts <mi...@spyfu.com> wrote:
>
>
> I'm trying to use Hive to solve a fairly common SQL scenario that I run
> into.  I have boiled the problem down into its most basic form:
>
>
>
> You have a table of transactions defined as so:  CREATE TABLE transactions
> (product_id INT, customer_id INT)
>
>
>
> |--------------------|
>
> |--Transactions------|
>
> |---product_id (INT)-|
>
> |---customer_id(INT)-|
>
> |--------------------|
>
>
>
>
>
> ****The goal is simple: For each product, produce a list of the top 5
> largest customers.****
>
>
>
> So, the base query would look like this:
>
>
>
> SELECT product_id, customer_id, count(*) as products_bought
>
> FROM transactions
>
> GROUP BY product_id, customer_id
>
>
>
> You could insert that value into another table called products_bought
> defined as:
>
> CREATE TABLE prod_bought
>
> (product_id INT, customer_id INT, products_bought INT)
>
>
>
> Now you have an intermediate result that tells you how many times each
> customer bought each product.  But, obviously, that doesn't completely solve
> the problem.
>
>
>
> At this point, in order to solve the problem, you'd have to use a cursor or
> a CROSS APPLY.  Here's an example in T-SQL:
>
>
>
> --THE CURSOR METHOD:
>
>
>
> DECLARE @productId int;
>
> DECLARE product_cur CURSOR FOR
>
> SELECT DISTINCT product_id
>
> FROM transactions t
>
>
>
> OPEN product_cur
>
>
>
> FETCH product_cur into @productId
>
>
>
> WHILE (@@FETCH_STATUS <> -1)
>
> BEGIN
>
>
>
>    FETCH product_cur into @productId
>
>    INSERT top_customers_by_product
>
>                 SELECT TOP 5 product_id, customer_id, products_bought
>
>                 FROM prod_bought
>
>                 WHERE product_id = @productId
>
>                 ORDER BY products_bought desc
>
>
>
>
>
> END
>
> CLOSE Domains
>
> DEALLOCATE Domains
>
>
>
>
>
> --THE CROSS APPLY METHOD:
>
>
>
> --First create a user defined function
>
> CREATE FUNCTION dbo.fn_GetTopXCustomers(@ProductId INT)
>
> RETURNS TABLE
>
> AS
>
> RETURN
>
>                 SELECT TOP 5 product_id, customer_id, products_bought
>
>                 FROM prod_bought
>
>                 WHERE product_id = @productId
>
>                 ORDER BY products_bought desc
>
> GO
>
>
>
> --Build a table of distinct product Ids
>
> SELECT DISTINCT product_id INTO temp_distinct_product_ids FROM transactions
>
>
>
> --Run the CROSS APPLY
>
> SELECT A.product_id
>
> , A.customer_id
>
> , A.products_bought
>
> INTO top_customers_by_product
>
> FROM temp_distinct_product_ids T
>
> CROSS APPLY dbo.fn_GetTopXCustomers(T.product_id) A
>
>
>
>
>
> Okay, so there are two ways I could solve the problem in SQL (CROSS APPLY is
> dramatically faster for anyone that cares).  How can I do the same thing in
> Hive?  Here's the question restated: How can I implement a cursor in Hive?
> How can I do a for or while loop in Hive?  Can I implement a CROSS APPLY in
> Hive?
>
>
>
>
>
> I realize that I can implement a cursor outside of Hive and just execute the
> same Hive script over and over and over again.  And, that's not a horrible
> solution as long as it leverages the full power of Hadoop.  My concern is
> that each of the individual queries that is run inside are fairly
> inexpensive, but the total number of products makes the total job *very*
> expensive.
>
>
>
> Also, the solution should be reusable -- I'd really prefer not to write a
> custom jar every time I run into this problem.
>
>
>
> Actually, I'm also not particularly religious about using Hive.  If there's
> some other tech that does what I need, that's cool too.
>
>
>
> Thanks in advance.
>
>
>
> Mike Roberts



-- 
Yours,
Zheng