You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Tamir Kamara <ta...@gmail.com> on 2009/03/01 15:35:01 UTC

SUM of an expression

Hi,

I'm trying to generate a sum of an expression like the following:

b = GROUP a by domain;
r = FOREACH b generate group, SUM(a.x+a.y);

This results in an error that DefaultDataBag cannot be cast to Tuple, but
both x and y are tuples (int).
This is each to get around by generating the inner expression of the sum in
a separate line, but I wonder if this isn't this something pig should be
able to do on its own?

Thanks,
Tamir

Re: SUM of an expression

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
Tamir Kamara wrote:
> Hi,
> 
> I'm trying to generate a sum of an expression like the following:
> 
> b = GROUP a by domain;
> r = FOREACH b generate group, SUM(a.x+a.y);

What you need is something like this :

b = GROUP a by domain;
r = FOREACH b {
   X = FOREACH a GENERATE x+y;
   generate group, SUM(X);
}

I dont think Pig supports this right now.

So you will need to simulate this through a UDF.

b = GROUP a by domain;
r = FOREACH b generate group, CUSTOM_SUM(a);

Within your udf, for each tuple in the input bag (a), pick 'x' and 'y', 
add it - and sum them all up.

Note - might be a good idea to make it an algebraic function (so that 
combiners get invoked for your script above).

Regards,
Mridul


> 
> This results in an error that DefaultDataBag cannot be cast to Tuple, but
> both x and y are tuples (int).
> This is each to get around by generating the inner expression of the sum in
> a separate line, but I wonder if this isn't this something pig should be
> able to do on its own?
> 
> Thanks,
> Tamir
> 


Re: SUM of an expression

Posted by Alan Gates <ga...@yahoo-inc.com>.
The issue here is the semantics of a.x and a.y.  Once you say "group  
a", then the a in "FOREACH b" is a bag.  a.x means take the bag a, and  
for each tuple project just the field x, and then put the resulting  
tuples in a bag.  So a.x is a bag of tuples with just the field x.   
Pig doesn't know how to add two bags.  So, if you change this to:

a1 = foreach a generate domain, x + y as xy;
b = group a1 by domain;
r = foreach b generate group, SUM(a.xy);

then the right things should happen.

Alan.

On Mar 1, 2009, at 6:35 AM, Tamir Kamara wrote:

> Hi,
>
> I'm trying to generate a sum of an expression like the following:
>
> b = GROUP a by domain;
> r = FOREACH b generate group, SUM(a.x+a.y);
>
> This results in an error that DefaultDataBag cannot be cast to  
> Tuple, but
> both x and y are tuples (int).
> This is each to get around by generating the inner expression of the  
> sum in
> a separate line, but I wonder if this isn't this something pig  
> should be
> able to do on its own?
>
> Thanks,
> Tamir