You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Adamantios Corais <ad...@gmail.com> on 2014/03/20 13:59:45 UTC

What's the equivalent of a GROUP BY statement within a FOREACH statement?

Hi,

I have the following schema: raw3: {group: (field1: int,field2: 
chararray,field3: chararray),raw2: {(field1: int,field2: 
chararray,field4: chararray,field3: chararray,field5: boolean)}}

What I am trying to do is extract the most frequent value of field4 and 
the most frequent of field field5 along with the group fields (field1, 
field2, field3).

I know that GROUP BY is not allowed (yet) with FOREACH statements. How 
can I accomplice the same functionality without writing a UDF?

Example:

input: ((1,2,3), 
{(1,2,a,3,x),(1,2,b,3,x),(1,2,a,3,x),(1,2,v,3,x),(1,2,f,3,z),(1,2,a,3,z)})

output: (1,2,a,3,x)

Thank you,
Adam.

Re: What's the equivalent of a GROUP BY statement within a FOREACH statement?

Posted by Jacob Perkins <ja...@gmail.com>.

Adam,

Take a look at the CountEach udf in the datafu library (http://datafu.incubator.apache.org/docs/datafu/1.2.0/datafu/pig/bags/CountEach.html). Eg:


res = foreach raw3 {
        field4bag = foreach raw2 generate field4;
        field5bag = foreach raw2 generate field5;
        field4cnts = CountEach(field4bag);
        field5cnts = CountEach(field5bag);

        field4max = TOP(1, 1, field4cnts);
        field5max = TOP(1, 1, field5cnts);
        generate
          flatten(group) as (field1, field2, field3),
          flatten(field4max.tuple_schema.$0) as field4max,
          flatten(field5max.tuple_schema.$0) as field5max;
      };

Generates: (1,2,3,(a),(x)) for your input. You can do further projections to rearrange how you like downstream.

Best of luck.

--jacob
@thedatachef


On Mar 20, 2014, at 5:59 AM, Adamantios Corais <ad...@gmail.com> wrote:

> Hi,
> 
> I have the following schema: raw3: {group: (field1: int,field2: chararray,field3: chararray),raw2: {(field1: int,field2: chararray,field4: chararray,field3: chararray,field5: boolean)}}
> 
> What I am trying to do is extract the most frequent value of field4 and the most frequent of field field5 along with the group fields (field1, field2, field3).
> 
> I know that GROUP BY is not allowed (yet) with FOREACH statements. How can I accomplice the same functionality without writing a UDF?
> 
> Example:
> 
> input: ((1,2,3), {(1,2,a,3,x),(1,2,b,3,x),(1,2,a,3,x),(1,2,v,3,x),(1,2,f,3,z),(1,2,a,3,z)})
> 
> output: (1,2,a,3,x)
> 
> Thank you,
> Adam.