You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Scott <sk...@weather.com> on 2010/01/21 14:38:25 UTC

Grouping on a parameter than can have multiple values

I have a question on how to handle data that I would usually store in an 
array, or into a normalized child table in a database.  The input data 
is a set of key/value pairs where one key can be associated with 
multiple values (0 to n).

Here is a sample dataset with bucket being the multi value key:

family=sports,channel=baseball,timeframe=today,gender=M,bucket=12,bucket=27,bucket=32
family=sports,channel=baseball,timeframe=today,gender=M,bucket=12,bucket=27,bucket=32,bucket=54
family=events,channel=outdoor,timeframe=weekend,gender=F,bucket=13,bucket=27,bucket=32
family=events,channel=outdoor,timeframe=weekend,gender=F,bucket=13,bucket=27,bucket=32

What I am trying to calculate is a group count on 
family,channel,timeframe and bucket, where the results would be:

(sports,baseball,today,12),2
(sports,baseball,today,27),2
(sports,baseball,today,32),2
(sports,baseball,today,54),1
(events,outdoor,weekend,13),2
(events,outdoor,weekend,27),2
(events,outdoor,weekend,32),2

One approach would seem to be to store the bucket values in a separate 
relation and join using a segregate key created when reading the data 
in.  Something like:

A = (12345,sports,baseball,today,M)
B = (32,12345)(27,12345)(12,12345)

C = JOIN A by $0, B by $1;

D = GROUP C by (family,channel,timeframe,bucket)

I am sure this method would work, but it requires generating a 
map/reduce friendly segregate key on which to join the data. Is there a 
more direct way to do this in pig?  Also, is it possible to load more 
than one relation at a time (split the data between two relations) with 
the LOAD statement?

Thanks,
Scott

Re: Grouping on a parameter than can have multiple values

Posted by Jeff Zhang <zj...@gmail.com>.

Hi Scott,

It seems that the bucket number of one record is arbitrary, then I suggest
you write your own LoadFunc to load the bucket in a DataBag. This is the pig
script:

A = LOAD 'input' USING YourLoadFunc() AS
(family,channel,timeframe,gender,b:{t:(bucket)};
B = FOREACH A GENERATE family,channel,timeframe,gender, FLATTEN(b) AS
bucket;
C = GROUP B BY (family,channel,timeframe,gender, bucket);
D = FOREACH C GENERATE group,COUNT($1);

Hope it helps you



On Thu, Jan 21, 2010 at 5:38 AM, Scott <sk...@weather.com> wrote:

> I have a question on how to handle data that I would usually store in an
> array, or into a normalized child table in a database.  The input data is a
> set of key/value pairs where one key can be associated with multiple values
> (0 to n).
>
> Here is a sample dataset with bucket being the multi value key:
>
>
> family=sports,channel=baseball,timeframe=today,gender=M,bucket=12,bucket=27,bucket=32
>
> family=sports,channel=baseball,timeframe=today,gender=M,bucket=12,bucket=27,bucket=32,bucket=54
>
> family=events,channel=outdoor,timeframe=weekend,gender=F,bucket=13,bucket=27,bucket=32
>
> family=events,channel=outdoor,timeframe=weekend,gender=F,bucket=13,bucket=27,bucket=32
>
> What I am trying to calculate is a group count on family,channel,timeframe
> and bucket, where the results would be:
>
> (sports,baseball,today,12),2
> (sports,baseball,today,27),2
> (sports,baseball,today,32),2
> (sports,baseball,today,54),1
> (events,outdoor,weekend,13),2
> (events,outdoor,weekend,27),2
> (events,outdoor,weekend,32),2
>
> One approach would seem to be to store the bucket values in a separate
> relation and join using a segregate key created when reading the data in.
>  Something like:
>
> A = (12345,sports,baseball,today,M)
> B = (32,12345)(27,12345)(12,12345)
>
> C = JOIN A by $0, B by $1;
>
> D = GROUP C by (family,channel,timeframe,bucket)
>
> I am sure this method would work, but it requires generating a map/reduce
> friendly segregate key on which to join the data. Is there a more direct way
> to do this in pig?  Also, is it possible to load more than one relation at a
> time (split the data between two relations) with the LOAD statement?
>
> Thanks,
> Scott
>
>


-- 
Best Regards

Jeff Zhang