You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by jamal sasha <ja...@gmail.com> on 2013/04/01 23:06:43 UTC

Join question

Hi,
  I have a simple join question.
base = load 'input1'   USING PigStorage( ',' ) as (id1, field1, field2);
stats = load 'input2' USING PigStorage(',') as (id1, mean, median);
joined = JOIN base BY  id1, stats BY id1;
final = FOREACH joined GENERATE base::id1, base::field1,base::field2,
stats::mean,stats::median;
STORE final INTO   'output'   USING PigStorage( ',' );

But something doesnt feels right.
Inputs are of order MB's.. whereas outputs are like 100GB's...

I tried it on sample file
where base is 35MB
stats is 10MB
and output explodes to GB's??
What am i missing?

Re: Join question

Posted by Mehmet Tepedelenlioglu <me...@yahoo.com>.
I am not sure if I understand you correctly, but you seem to want to find
the average per id. For that all you need to do is group by id, and then
take the avg for every group. You don't need to count anything.

On 4/1/13 3:44 PM, "jamal sasha" <ja...@gmail.com> wrote:

>Hi,
>  Yeah, there was a bug in my "stats" data.
>I was wondering how can I calcualte average in pig..
>Something like :
>http://stackoverflow.com/questions/12593527/finding-mean-using-pig-or-hado
>op
>
>But in top response.. it seems that the user wanted to calculate across
>average across all data..
>as
>
>count = COUNT(inpt)
>and inpt is the complete input
>whereas what i want.. that denominator is count for each id..
>
>so my data is like:
>
>id, value
>1,1.0
>1,3.0
>1,5.0
>2,1.0
>
>So, the average I am expecting is:
>
> 1, 3.0
>2,1.0
>
>as 1 +3 + 5 /3 = 3
>whereas in the example.. count(inpt) should give me 4?
>
>How do i achieve this.
>Thanks
>
>
>
>
>
>
>
>
>On Mon, Apr 1, 2013 at 2:24 PM, Mehmet Tepedelenlioglu
><me...@yahoo.com>
>wrote:
>>
>> Are your ids unique?
>>
>> On 4/1/13 2:06 PM, "jamal sasha" <ja...@gmail.com> wrote:
>>
>> >Hi,
>> >  I have a simple join question.
>> >base = load 'input1'   USING PigStorage( ',' ) as (id1, field1,
>>field2);
>> >stats = load 'input2' USING PigStorage(',') as (id1, mean, median);
>> >joined = JOIN base BY  id1, stats BY id1;
>> >final = FOREACH joined GENERATE base::id1, base::field1,base::field2,
>> >stats::mean,stats::median;
>> >STORE final INTO   'output'   USING PigStorage( ',' );
>> >
>> >But something doesnt feels right.
>> >Inputs are of order MB's.. whereas outputs are like 100GB's...
>> >
>> >I tried it on sample file
>> >where base is 35MB
>> >stats is 10MB
>> >and output explodes to GB's??
>> >What am i missing?
>>
>>



Re: Join question

Posted by jamal sasha <ja...@gmail.com>.
Hi,
  Yeah, there was a bug in my "stats" data.
I was wondering how can I calcualte average in pig..
Something like :
http://stackoverflow.com/questions/12593527/finding-mean-using-pig-or-hadoop

But in top response.. it seems that the user wanted to calculate across
average across all data..
as

count = COUNT(inpt)
and inpt is the complete input
whereas what i want.. that denominator is count for each id..

so my data is like:

id, value
1,1.0
1,3.0
1,5.0
2,1.0

So, the average I am expecting is:

 1, 3.0
2,1.0

as 1 +3 + 5 /3 = 3
whereas in the example.. count(inpt) should give me 4?

How do i achieve this.
Thanks








On Mon, Apr 1, 2013 at 2:24 PM, Mehmet Tepedelenlioglu <me...@yahoo.com>
wrote:
>
> Are your ids unique?
>
> On 4/1/13 2:06 PM, "jamal sasha" <ja...@gmail.com> wrote:
>
> >Hi,
> >  I have a simple join question.
> >base = load 'input1'   USING PigStorage( ',' ) as (id1, field1, field2);
> >stats = load 'input2' USING PigStorage(',') as (id1, mean, median);
> >joined = JOIN base BY  id1, stats BY id1;
> >final = FOREACH joined GENERATE base::id1, base::field1,base::field2,
> >stats::mean,stats::median;
> >STORE final INTO   'output'   USING PigStorage( ',' );
> >
> >But something doesnt feels right.
> >Inputs are of order MB's.. whereas outputs are like 100GB's...
> >
> >I tried it on sample file
> >where base is 35MB
> >stats is 10MB
> >and output explodes to GB's??
> >What am i missing?
>
>

Re: Join question

Posted by Mehmet Tepedelenlioglu <me...@yahoo.com>.
Are your ids unique?

On 4/1/13 2:06 PM, "jamal sasha" <ja...@gmail.com> wrote:

>Hi,
>  I have a simple join question.
>base = load 'input1'   USING PigStorage( ',' ) as (id1, field1, field2);
>stats = load 'input2' USING PigStorage(',') as (id1, mean, median);
>joined = JOIN base BY  id1, stats BY id1;
>final = FOREACH joined GENERATE base::id1, base::field1,base::field2,
>stats::mean,stats::median;
>STORE final INTO   'output'   USING PigStorage( ',' );
>
>But something doesnt feels right.
>Inputs are of order MB's.. whereas outputs are like 100GB's...
>
>I tried it on sample file
>where base is 35MB
>stats is 10MB
>and output explodes to GB's??
>What am i missing?