You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by jamal sasha <ja...@gmail.com> on 2012/09/26 03:36:24 UTC

finding mean and standard deviation

Hi,
   I have a huge text file of form
data is saved in directory data/data1.txt, data2.txt and so on
 merchant_id, user_id, amount
  1234, 9123, 299.2
  1233, 9199, 203.2
  1234, 0124, 230
  and so on..

What I want to do is for each merchant, find the average amount..
so basically in the end i want to save the output in file.
something like
merchant_id, average_amount
 1234, avg_amt_1234 a
  and so on.
How do I calculate the standard deviation as well?

Sorry for asking such a basic question. :(
Any help would be appreciated. :)
Jamal

Re: finding mean and standard deviation

Posted by Cheolsoo Park <ch...@cloudera.com>.
Oh, sure.  Please find more info about UDF here:
http://pig.apache.org/docs/r0.10.0/udf.html

On Tue, Sep 25, 2012 at 8:16 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>   Thanks for replying.
> Err I am a new here.
> I am trying to find the info as in what is UDF?
>
>
> On Tue, Sep 25, 2012 at 10:41 PM, Cheolsoo Park <cheolsoo@cloudera.com
> >wrote:
>
> > Hi,
> >
> > in = load 'in.txt' using PigStorage(',') as (merchant:int, customer:int,
> > amount:float);
> > perMerchant = group in by merchant;
> > avg = foreach perMerchant generate group, AVG(in.amount);
> > dump avg;
> >
> > This returns (merchant_id, avg of amount) as follows:
> >
> > (1233,203.1999969482422)
> > (1234,264.6000061035156)
> >
> > Regarding standard deviation, you can write your own UDF that computes
> it.
> > Please take a look at AVG.java to see how it compute the average.
> > Basically, you need to modify the exec() method to compute standard
> > deviation instead of average.
> >
> > Thanks,
> > Cheolsoo
> >
> > On Tue, Sep 25, 2012 at 6:36 PM, jamal sasha <ja...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >    I have a huge text file of form
> > > data is saved in directory data/data1.txt, data2.txt and so on
> > >  merchant_id, user_id, amount
> > >   1234, 9123, 299.2
> > >   1233, 9199, 203.2
> > >   1234, 0124, 230
> > >   and so on..
> > >
> > > What I want to do is for each merchant, find the average amount..
> > > so basically in the end i want to save the output in file.
> > > something like
> > > merchant_id, average_amount
> > >  1234, avg_amt_1234 a
> > >   and so on.
> > > How do I calculate the standard deviation as well?
> > >
> > > Sorry for asking such a basic question. :(
> > > Any help would be appreciated. :)
> > > Jamal
> > >
> >
>

Re: finding mean and standard deviation

Posted by jamal sasha <ja...@gmail.com>.
Hi,
  Thanks for replying.
Err I am a new here.
I am trying to find the info as in what is UDF?


On Tue, Sep 25, 2012 at 10:41 PM, Cheolsoo Park <ch...@cloudera.com>wrote:

> Hi,
>
> in = load 'in.txt' using PigStorage(',') as (merchant:int, customer:int,
> amount:float);
> perMerchant = group in by merchant;
> avg = foreach perMerchant generate group, AVG(in.amount);
> dump avg;
>
> This returns (merchant_id, avg of amount) as follows:
>
> (1233,203.1999969482422)
> (1234,264.6000061035156)
>
> Regarding standard deviation, you can write your own UDF that computes it.
> Please take a look at AVG.java to see how it compute the average.
> Basically, you need to modify the exec() method to compute standard
> deviation instead of average.
>
> Thanks,
> Cheolsoo
>
> On Tue, Sep 25, 2012 at 6:36 PM, jamal sasha <ja...@gmail.com>
> wrote:
>
> > Hi,
> >    I have a huge text file of form
> > data is saved in directory data/data1.txt, data2.txt and so on
> >  merchant_id, user_id, amount
> >   1234, 9123, 299.2
> >   1233, 9199, 203.2
> >   1234, 0124, 230
> >   and so on..
> >
> > What I want to do is for each merchant, find the average amount..
> > so basically in the end i want to save the output in file.
> > something like
> > merchant_id, average_amount
> >  1234, avg_amt_1234 a
> >   and so on.
> > How do I calculate the standard deviation as well?
> >
> > Sorry for asking such a basic question. :(
> > Any help would be appreciated. :)
> > Jamal
> >
>

RE: finding mean and standard deviation

Posted by "Manish.Bhoge" <Ma...@target.com>.
Check the Datafu library from Linkedin. It should have all the statistical function you are expecting to use in PIG.

https://github.com/linkedin/datafu

Thank You,
Manish.

-----Original Message-----
From: Cheolsoo Park [mailto:cheolsoo@cloudera.com] 
Sent: Wednesday, September 26, 2012 8:11 AM
To: user@pig.apache.org
Subject: Re: finding mean and standard deviation

Hi,

in = load 'in.txt' using PigStorage(',') as (merchant:int, customer:int,
amount:float);
perMerchant = group in by merchant;
avg = foreach perMerchant generate group, AVG(in.amount);
dump avg;

This returns (merchant_id, avg of amount) as follows:

(1233,203.1999969482422)
(1234,264.6000061035156)

Regarding standard deviation, you can write your own UDF that computes it.
Please take a look at AVG.java to see how it compute the average.
Basically, you need to modify the exec() method to compute standard
deviation instead of average.

Thanks,
Cheolsoo

On Tue, Sep 25, 2012 at 6:36 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>    I have a huge text file of form
> data is saved in directory data/data1.txt, data2.txt and so on
>  merchant_id, user_id, amount
>   1234, 9123, 299.2
>   1233, 9199, 203.2
>   1234, 0124, 230
>   and so on..
>
> What I want to do is for each merchant, find the average amount..
> so basically in the end i want to save the output in file.
> something like
> merchant_id, average_amount
>  1234, avg_amt_1234 a
>   and so on.
> How do I calculate the standard deviation as well?
>
> Sorry for asking such a basic question. :(
> Any help would be appreciated. :)
> Jamal
>

Re: finding mean and standard deviation

Posted by Cheolsoo Park <ch...@cloudera.com>.
Hi,

in = load 'in.txt' using PigStorage(',') as (merchant:int, customer:int,
amount:float);
perMerchant = group in by merchant;
avg = foreach perMerchant generate group, AVG(in.amount);
dump avg;

This returns (merchant_id, avg of amount) as follows:

(1233,203.1999969482422)
(1234,264.6000061035156)

Regarding standard deviation, you can write your own UDF that computes it.
Please take a look at AVG.java to see how it compute the average.
Basically, you need to modify the exec() method to compute standard
deviation instead of average.

Thanks,
Cheolsoo

On Tue, Sep 25, 2012 at 6:36 PM, jamal sasha <ja...@gmail.com> wrote:

> Hi,
>    I have a huge text file of form
> data is saved in directory data/data1.txt, data2.txt and so on
>  merchant_id, user_id, amount
>   1234, 9123, 299.2
>   1233, 9199, 203.2
>   1234, 0124, 230
>   and so on..
>
> What I want to do is for each merchant, find the average amount..
> so basically in the end i want to save the output in file.
> something like
> merchant_id, average_amount
>  1234, avg_amt_1234 a
>   and so on.
> How do I calculate the standard deviation as well?
>
> Sorry for asking such a basic question. :(
> Any help would be appreciated. :)
> Jamal
>