You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Mike Roberts <mi...@spyfu.com> on 2010/01/26 07:50:39 UTC

How can I implement a cursor in Pig? ...or... Can I implement a CROSS APPLY in Pig?...or... How can I do a FOR or WHILE loop (inside or outside) of Pig?

I'm trying to use Pig to solve a fairly common SQL scenario that I run into.  I have boiled the problem down into its most basic form:

You have a table of transactions defined as so:  CREATE TABLE transactions (product_id INT, customer_id INT)

|--------------------|
|--Transactions------|
|---product_id (INT)-|
|---customer_id(INT)-|
|--------------------|


****The goal is simple: For each product, produce a list of the top 5 largest customers.****

So, the base query would look like this:

SELECT product_id, customer_id, count(*) as products_bought
FROM transactions
GROUP BY product_id, customer_id

You could insert that value into another table called products_bought defined as:
CREATE TABLE prod_bought
(product_id INT, customer_id INT, products_bought INT)

Now you have an intermediate result that tells you how many times each customer bought each product.  But, obviously, that doesn't completely solve the problem.

At this point, in order to solve the problem, you'd have to use a cursor or a CROSS APPLY.  Here's an example in T-SQL:

--THE CURSOR METHOD:

DECLARE @productId int;
DECLARE product_cur CURSOR FOR
SELECT DISTINCT product_id
FROM transactions t

OPEN product_cur

FETCH product_cur into @productId

WHILE (@@FETCH_STATUS <> -1)
BEGIN

   FETCH product_cur into @productId
   INSERT top_customers_by_product
                SELECT TOP 5 product_id, customer_id, products_bought
                FROM prod_bought
                WHERE product_id = @productId
                ORDER BY products_bought desc


END
CLOSE Domains
DEALLOCATE Domains


--THE CROSS APPLY METHOD:

--First create a user defined function
CREATE FUNCTION dbo.fn_GetTopXCustomers(@ProductId INT)
RETURNS TABLE
AS
RETURN
                SELECT TOP 5 product_id, customer_id, products_bought
                FROM prod_bought
                WHERE product_id = @productId
                ORDER BY products_bought desc
GO

--Build a table of distinct product Ids
SELECT DISTINCT product_id INTO temp_distinct_product_ids FROM transactions

--Run the CROSS APPLY
SELECT A.product_id
, A.customer_id
, A.products_bought
INTO top_customers_by_product
FROM temp_distinct_product_ids T
CROSS APPLY dbo.fn_GetTopXCustomers(T.product_id) A


Okay, so there are two ways I could solve the problem in SQL (CROSS APPLY is dramatically faster for anyone that cares).  How can I do the same thing in Pig?  Here's the question restated: How can I implement a cursor in Pig?  How can I do a for or while loop in Pig?  Can I implement a CROSS APPLY in Pig?


I realize that I can implement a cursor outside of Pig and just execute the same Pig script over and over and over again.  And, that's not a horrible solution as long as it leverages the full power of Hadoop.  My concern is that each of the individual queries that is run inside are fairly inexpensive, but the total number of products makes the total job *very* expensive.

Also, the solution should be reusable -- I'd really prefer not to write a custom jar every time I run into this problem.

Actually, I'm also not particularly religious about using Pig.  If there's some other tech that does what I need, that's cool too.

Thanks in advance.

Mike Roberts


Re: How can I implement a cursor in Pig? ...or... Can I implement a CROSS APPLY in Pig?...or... How can I do a FOR or WHILE loop (inside or outside) of Pig?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Mike,

Pig has a FOREACH operator that is somewhat analogous to opening a
cursor on a relation.

So you can do something like:

transactions = LOAD '/foo' ...;
by_prod = GROUP data BY product_id;
top_prod_users = FOREACH by_prod GENERATE group as product_id, Top(5,
1, transactions) as top5;
STORE top_prod_users into ...

You would have to write the Top UDF (or wait till I wrap it up, it so
happens I am working on it right now). Alternately, you can also do
this:

top_prod_users = FOREACH by_prod {
  ordered = ORDER transactions BY customer_id DESC;
  top5 = LIMIT ordered 5;
   GENERATE group as product_id, top5;
}

If you are used to SQL, the return of this may be a little weird --
the top 5 users will be a bag (essentially, an unordered array in a
single column).  You can use the FLATTEN operator to get that into a
familiar row-by-row representation.

-D



On Mon, Jan 25, 2010 at 10:50 PM, Mike Roberts <mi...@spyfu.com> wrote:
> I'm trying to use Pig to solve a fairly common SQL scenario that I run into.  I have boiled the problem down into its most basic form:
>
> You have a table of transactions defined as so:  CREATE TABLE transactions (product_id INT, customer_id INT)
>
> |--------------------|
> |--Transactions------|
> |---product_id (INT)-|
> |---customer_id(INT)-|
> |--------------------|
>
>
> ****The goal is simple: For each product, produce a list of the top 5 largest customers.****
>
> So, the base query would look like this:
>
> SELECT product_id, customer_id, count(*) as products_bought
> FROM transactions
> GROUP BY product_id, customer_id
>
> You could insert that value into another table called products_bought defined as:
> CREATE TABLE prod_bought
> (product_id INT, customer_id INT, products_bought INT)
>
> Now you have an intermediate result that tells you how many times each customer bought each product.  But, obviously, that doesn't completely solve the problem.
>
> At this point, in order to solve the problem, you'd have to use a cursor or a CROSS APPLY.  Here's an example in T-SQL:
>
> --THE CURSOR METHOD:
>
> DECLARE @productId int;
> DECLARE product_cur CURSOR FOR
> SELECT DISTINCT product_id
> FROM transactions t
>
> OPEN product_cur
>
> FETCH product_cur into @productId
>
> WHILE (@@FETCH_STATUS <> -1)
> BEGIN
>
>   FETCH product_cur into @productId
>   INSERT top_customers_by_product
>                SELECT TOP 5 product_id, customer_id, products_bought
>                FROM prod_bought
>                WHERE product_id = @productId
>                ORDER BY products_bought desc
>
>
> END
> CLOSE Domains
> DEALLOCATE Domains
>
>
> --THE CROSS APPLY METHOD:
>
> --First create a user defined function
> CREATE FUNCTION dbo.fn_GetTopXCustomers(@ProductId INT)
> RETURNS TABLE
> AS
> RETURN
>                SELECT TOP 5 product_id, customer_id, products_bought
>                FROM prod_bought
>                WHERE product_id = @productId
>                ORDER BY products_bought desc
> GO
>
> --Build a table of distinct product Ids
> SELECT DISTINCT product_id INTO temp_distinct_product_ids FROM transactions
>
> --Run the CROSS APPLY
> SELECT A.product_id
> , A.customer_id
> , A.products_bought
> INTO top_customers_by_product
> FROM temp_distinct_product_ids T
> CROSS APPLY dbo.fn_GetTopXCustomers(T.product_id) A
>
>
> Okay, so there are two ways I could solve the problem in SQL (CROSS APPLY is dramatically faster for anyone that cares).  How can I do the same thing in Pig?  Here's the question restated: How can I implement a cursor in Pig?  How can I do a for or while loop in Pig?  Can I implement a CROSS APPLY in Pig?
>
>
> I realize that I can implement a cursor outside of Pig and just execute the same Pig script over and over and over again.  And, that's not a horrible solution as long as it leverages the full power of Hadoop.  My concern is that each of the individual queries that is run inside are fairly inexpensive, but the total number of products makes the total job *very* expensive.
>
> Also, the solution should be reusable -- I'd really prefer not to write a custom jar every time I run into this problem.
>
> Actually, I'm also not particularly religious about using Pig.  If there's some other tech that does what I need, that's cool too.
>
> Thanks in advance.
>
> Mike Roberts
>
>