You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by John Meek <jo...@aol.com> on 2013/03/31 18:06:28 UTC

Spreading data in Pig

hey all,

Can anyone let me know how I can accomplish below problem in Pig?

I have 2 data sources:

TABLE A with a list of User IDs:

User1
User2
User3
User4
User5
User6
User7
User8
User9

TABLE B with (Host name, Capacity):

Hostb 2
Hostc 4
Hostd 3


I basically need to spread the data in table A based on Table B based on how much capacity Table B has.

So end result should be a file:

User1 Hostb
User2 Hostb
User3 Hostc
User4 Hostc
User5 Hostc
User6 Hostc
User7 Hostd
User8 Hostd
User9 Hostd

The order does not matter as long as each Host gets the capacity it can take. Also the SUM(TableB.Capacity) will always be COUNT(TableA.UserID) so there wont be any extra or less values to plug in.


thanks,
JM

Re: Spreading data in Pig

Posted by John Meek <jo...@aol.com>.

Thanks Jacob. That looks like it will work. I got to figure out a way to transpose that R function in jython to make a udf consistent with the rest of my script .Thanks.
 

 

 

-----Original Message-----
From: Jacob Perkins <ja...@gmail.com>
To: user <us...@pig.apache.org>
Sent: Sun, Mar 31, 2013 2:13 pm
Subject: Re: Spreading data in Pig


Hi John,

The only way I can think of to do this is using the RANK operator
(available only in pig version 0.11) along with a custom udf as follows:

* RANK the users relation to result in something like:

(User1, 1)
(User2, 2)
(User3, 3)
(User4, 4)
(User5, 5)
(User6, 6)
(User7, 7)
(User8, 8)
(User9, 9)

* Use a udf that functions much like the rstats "seq" function
(http://stat.ethz.ch/R-manual/R-devel/library/base/html/seq.html) that
generates a bag containing integers from 0 up to the capacity of a given
host:

(Hostb, {(0),(1)})
(Hostc, {(0),(1),(2),(3)})
(Hostd, {(0),(1),(2)})

which can then be flattened in a projection to result in:

(Hostb, 0)
(Hostb, 1)
(Hostc, 0)
(Hostc, 1)
(Hostc, 2)
(Hostc, 3)
(Hostd, 0)
(Hostd, 1)
(Hostd, 2)

(Basically reversing any aggregation that was done to produce the
capacity count in the first place...)

* Rank the exploded set of hosts to result in:

(Hostb, 1)
(Hostb, 2)
(Hostc, 3)
(Hostc, 4)
(Hostc, 5)
(Hostc, 6)
(Hostd, 7)
(Hostd, 8)
(Hostd, 9)

* You can then join the ranked hosts and the ranked users by rank and
project out fields you don't need to result in:

(Hostb, User1)
(Hostb, User2)
(Hostc, User3)
(Hostc, User4)
(Hostc, User5)
(Hostc, User6)
(Hostd, User7)
(Hostd, User8)
(Hostd, User9)

Here's some example pig code that I used that works with pig 0.11 (I
already have a Seq udf):

************

users = load 'users' as (user_id:chararray);
hosts = load 'hosts' as (host_id:chararray, capacity:int);

hosts_exploded = foreach hosts {
                   sequence = Seq(0, capacity, capacity);
                   generate
                     host_id           as host_id,
                     flatten(sequence) as num;
                 };

ranked_users = rank users;
ranked_hosts = rank hosts_exploded;

spread = foreach (join ranked_users by $0, ranked_hosts by $0) generate
host_id, user_id;

dump spread;

************


Hope that helps!

--jacob
@thedatachef

On Sun, 2013-03-31 at 12:06 -0400, John Meek wrote:
> hey all,
> 
> Can anyone let me know how I can accomplish below problem in Pig?
> 
> I have 2 data sources:
> 
> TABLE A with a list of User IDs:
> 
> User1
> User2
> User3
> User4
> User5
> User6
> User7
> User8
> User9
> 
> TABLE B with (Host name, Capacity):
> 
> Hostb 2
> Hostc 4
> Hostd 3
> 
> 
> I basically need to spread the data in table A based on Table B based on how 
much capacity Table B has.
> 
> So end result should be a file:
> 
> User1 Hostb
> User2 Hostb
> User3 Hostc
> User4 Hostc
> User5 Hostc
> User6 Hostc
> User7 Hostd
> User8 Hostd
> User9 Hostd
> 
> The order does not matter as long as each Host gets the capacity it can take. 
Also the SUM(TableB.Capacity) will always be COUNT(TableA.UserID) so there wont 
be any extra or less values to plug in.
> 
> 
> thanks,
> JM
> 
>

Re: Spreading data in Pig

Posted by Jacob Perkins <ja...@gmail.com>.

Hi John,

The only way I can think of to do this is using the RANK operator
(available only in pig version 0.11) along with a custom udf as follows:

* RANK the users relation to result in something like:

(User1, 1)
(User2, 2)
(User3, 3)
(User4, 4)
(User5, 5)
(User6, 6)
(User7, 7)
(User8, 8)
(User9, 9)

* Use a udf that functions much like the rstats "seq" function
(http://stat.ethz.ch/R-manual/R-devel/library/base/html/seq.html) that
generates a bag containing integers from 0 up to the capacity of a given
host:

(Hostb, {(0),(1)})
(Hostc, {(0),(1),(2),(3)})
(Hostd, {(0),(1),(2)})

which can then be flattened in a projection to result in:

(Hostb, 0)
(Hostb, 1)
(Hostc, 0)
(Hostc, 1)
(Hostc, 2)
(Hostc, 3)
(Hostd, 0)
(Hostd, 1)
(Hostd, 2)

(Basically reversing any aggregation that was done to produce the
capacity count in the first place...)

* Rank the exploded set of hosts to result in:

(Hostb, 1)
(Hostb, 2)
(Hostc, 3)
(Hostc, 4)
(Hostc, 5)
(Hostc, 6)
(Hostd, 7)
(Hostd, 8)
(Hostd, 9)

* You can then join the ranked hosts and the ranked users by rank and
project out fields you don't need to result in:

(Hostb, User1)
(Hostb, User2)
(Hostc, User3)
(Hostc, User4)
(Hostc, User5)
(Hostc, User6)
(Hostd, User7)
(Hostd, User8)
(Hostd, User9)

Here's some example pig code that I used that works with pig 0.11 (I
already have a Seq udf):

************

users = load 'users' as (user_id:chararray);
hosts = load 'hosts' as (host_id:chararray, capacity:int);

hosts_exploded = foreach hosts {
                   sequence = Seq(0, capacity, capacity);
                   generate
                     host_id           as host_id,
                     flatten(sequence) as num;
                 };

ranked_users = rank users;
ranked_hosts = rank hosts_exploded;

spread = foreach (join ranked_users by $0, ranked_hosts by $0) generate
host_id, user_id;

dump spread;

************


Hope that helps!

--jacob
@thedatachef

On Sun, 2013-03-31 at 12:06 -0400, John Meek wrote:
> hey all,
> 
> Can anyone let me know how I can accomplish below problem in Pig?
> 
> I have 2 data sources:
> 
> TABLE A with a list of User IDs:
> 
> User1
> User2
> User3
> User4
> User5
> User6
> User7
> User8
> User9
> 
> TABLE B with (Host name, Capacity):
> 
> Hostb 2
> Hostc 4
> Hostd 3
> 
> 
> I basically need to spread the data in table A based on Table B based on how much capacity Table B has.
> 
> So end result should be a file:
> 
> User1 Hostb
> User2 Hostb
> User3 Hostc
> User4 Hostc
> User5 Hostc
> User6 Hostc
> User7 Hostd
> User8 Hostd
> User9 Hostd
> 
> The order does not matter as long as each Host gets the capacity it can take. Also the SUM(TableB.Capacity) will always be COUNT(TableA.UserID) so there wont be any extra or less values to plug in.
> 
> 
> thanks,
> JM
> 
>