You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by John Meek <jo...@aol.com> on 2013/03/31 18:06:28 UTC
Spreading data in Pig
hey all,
Can anyone let me know how I can accomplish below problem in Pig?
I have 2 data sources:
TABLE A with a list of User IDs:
User1
User2
User3
User4
User5
User6
User7
User8
User9
TABLE B with (Host name, Capacity):
Hostb 2
Hostc 4
Hostd 3
I basically need to spread the data in table A based on Table B based on how much capacity Table B has.
So end result should be a file:
User1 Hostb
User2 Hostb
User3 Hostc
User4 Hostc
User5 Hostc
User6 Hostc
User7 Hostd
User8 Hostd
User9 Hostd
The order does not matter as long as each Host gets the capacity it can take. Also the SUM(TableB.Capacity) will always be COUNT(TableA.UserID) so there wont be any extra or less values to plug in.
thanks,
JM
Re: Spreading data in Pig
Posted by John Meek <jo...@aol.com>.
Thanks Jacob. That looks like it will work. I got to figure out a way to transpose that R function in jython to make a udf consistent with the rest of my script .Thanks.
-----Original Message-----
From: Jacob Perkins <ja...@gmail.com>
To: user <us...@pig.apache.org>
Sent: Sun, Mar 31, 2013 2:13 pm
Subject: Re: Spreading data in Pig
Hi John,
The only way I can think of to do this is using the RANK operator
(available only in pig version 0.11) along with a custom udf as follows:
* RANK the users relation to result in something like:
(User1, 1)
(User2, 2)
(User3, 3)
(User4, 4)
(User5, 5)
(User6, 6)
(User7, 7)
(User8, 8)
(User9, 9)
* Use a udf that functions much like the rstats "seq" function
(http://stat.ethz.ch/R-manual/R-devel/library/base/html/seq.html) that
generates a bag containing integers from 0 up to the capacity of a given
host:
(Hostb, {(0),(1)})
(Hostc, {(0),(1),(2),(3)})
(Hostd, {(0),(1),(2)})
which can then be flattened in a projection to result in:
(Hostb, 0)
(Hostb, 1)
(Hostc, 0)
(Hostc, 1)
(Hostc, 2)
(Hostc, 3)
(Hostd, 0)
(Hostd, 1)
(Hostd, 2)
(Basically reversing any aggregation that was done to produce the
capacity count in the first place...)
* Rank the exploded set of hosts to result in:
(Hostb, 1)
(Hostb, 2)
(Hostc, 3)
(Hostc, 4)
(Hostc, 5)
(Hostc, 6)
(Hostd, 7)
(Hostd, 8)
(Hostd, 9)
* You can then join the ranked hosts and the ranked users by rank and
project out fields you don't need to result in:
(Hostb, User1)
(Hostb, User2)
(Hostc, User3)
(Hostc, User4)
(Hostc, User5)
(Hostc, User6)
(Hostd, User7)
(Hostd, User8)
(Hostd, User9)
Here's some example pig code that I used that works with pig 0.11 (I
already have a Seq udf):
************
users = load 'users' as (user_id:chararray);
hosts = load 'hosts' as (host_id:chararray, capacity:int);
hosts_exploded = foreach hosts {
sequence = Seq(0, capacity, capacity);
generate
host_id as host_id,
flatten(sequence) as num;
};
ranked_users = rank users;
ranked_hosts = rank hosts_exploded;
spread = foreach (join ranked_users by $0, ranked_hosts by $0) generate
host_id, user_id;
dump spread;
************
Hope that helps!
--jacob
@thedatachef
On Sun, 2013-03-31 at 12:06 -0400, John Meek wrote:
> hey all,
>
> Can anyone let me know how I can accomplish below problem in Pig?
>
> I have 2 data sources:
>
> TABLE A with a list of User IDs:
>
> User1
> User2
> User3
> User4
> User5
> User6
> User7
> User8
> User9
>
> TABLE B with (Host name, Capacity):
>
> Hostb 2
> Hostc 4
> Hostd 3
>
>
> I basically need to spread the data in table A based on Table B based on how
much capacity Table B has.
>
> So end result should be a file:
>
> User1 Hostb
> User2 Hostb
> User3 Hostc
> User4 Hostc
> User5 Hostc
> User6 Hostc
> User7 Hostd
> User8 Hostd
> User9 Hostd
>
> The order does not matter as long as each Host gets the capacity it can take.
Also the SUM(TableB.Capacity) will always be COUNT(TableA.UserID) so there wont
be any extra or less values to plug in.
>
>
> thanks,
> JM
>
>
Re: Spreading data in Pig
Posted by Jacob Perkins <ja...@gmail.com>.
Hi John,
The only way I can think of to do this is using the RANK operator
(available only in pig version 0.11) along with a custom udf as follows:
* RANK the users relation to result in something like:
(User1, 1)
(User2, 2)
(User3, 3)
(User4, 4)
(User5, 5)
(User6, 6)
(User7, 7)
(User8, 8)
(User9, 9)
* Use a udf that functions much like the rstats "seq" function
(http://stat.ethz.ch/R-manual/R-devel/library/base/html/seq.html) that
generates a bag containing integers from 0 up to the capacity of a given
host:
(Hostb, {(0),(1)})
(Hostc, {(0),(1),(2),(3)})
(Hostd, {(0),(1),(2)})
which can then be flattened in a projection to result in:
(Hostb, 0)
(Hostb, 1)
(Hostc, 0)
(Hostc, 1)
(Hostc, 2)
(Hostc, 3)
(Hostd, 0)
(Hostd, 1)
(Hostd, 2)
(Basically reversing any aggregation that was done to produce the
capacity count in the first place...)
* Rank the exploded set of hosts to result in:
(Hostb, 1)
(Hostb, 2)
(Hostc, 3)
(Hostc, 4)
(Hostc, 5)
(Hostc, 6)
(Hostd, 7)
(Hostd, 8)
(Hostd, 9)
* You can then join the ranked hosts and the ranked users by rank and
project out fields you don't need to result in:
(Hostb, User1)
(Hostb, User2)
(Hostc, User3)
(Hostc, User4)
(Hostc, User5)
(Hostc, User6)
(Hostd, User7)
(Hostd, User8)
(Hostd, User9)
Here's some example pig code that I used that works with pig 0.11 (I
already have a Seq udf):
************
users = load 'users' as (user_id:chararray);
hosts = load 'hosts' as (host_id:chararray, capacity:int);
hosts_exploded = foreach hosts {
sequence = Seq(0, capacity, capacity);
generate
host_id as host_id,
flatten(sequence) as num;
};
ranked_users = rank users;
ranked_hosts = rank hosts_exploded;
spread = foreach (join ranked_users by $0, ranked_hosts by $0) generate
host_id, user_id;
dump spread;
************
Hope that helps!
--jacob
@thedatachef
On Sun, 2013-03-31 at 12:06 -0400, John Meek wrote:
> hey all,
>
> Can anyone let me know how I can accomplish below problem in Pig?
>
> I have 2 data sources:
>
> TABLE A with a list of User IDs:
>
> User1
> User2
> User3
> User4
> User5
> User6
> User7
> User8
> User9
>
> TABLE B with (Host name, Capacity):
>
> Hostb 2
> Hostc 4
> Hostd 3
>
>
> I basically need to spread the data in table A based on Table B based on how much capacity Table B has.
>
> So end result should be a file:
>
> User1 Hostb
> User2 Hostb
> User3 Hostc
> User4 Hostc
> User5 Hostc
> User6 Hostc
> User7 Hostd
> User8 Hostd
> User9 Hostd
>
> The order does not matter as long as each Host gets the capacity it can take. Also the SUM(TableB.Capacity) will always be COUNT(TableA.UserID) so there wont be any extra or less values to plug in.
>
>
> thanks,
> JM
>
>