You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Prashanth Pappu <pr...@conviva.com> on 2008/06/19 21:20:23 UTC

Performance/coding question

I have a PIG script that simply generates a lot of 'counts' over very large
data. For example,

a = load 'data' as (x,y,z);

b1 = filter a by x==1;
b1_group = group b1 all;
b1_count = foreach b1_group generate COUNT(b1);

b2 = filter a by y==1;
b2_group = group b2 all;
b2_count = foreach b2_group generate COUNT(b2);

...etc

Suppose that we need to generate counts b1 to b1000. Now, PIG generates 1000
different hadoop jobs (one for each count). While each job finishes fast
enough, the per job overhead considerably slows down the script. So, I have
two questions

(a) If I want to generate many counts by simply filtering the rows of the
data - is there a better way to code this script?
(b)  Are there any PIG optimizations (current or planned) that will cause
PIG to  generate  fewer number of jobs? Because, clearly, one can write a
single java map-reduce job to accomplish the task.

Thanks,
Prashanth

RE: Performance/coding question

Posted by Utkarsh Srivastava <ut...@yahoo-inc.com>.

You can write a function myFunc that outputs for a particular record
which of the counts b1 .. b1000 it contributes to (it could even
contribute to more than 1, in which case myFunc() should be a
EvalFunc<DataBag>).

Then

A = load 'data'
B = group a by flatten(myFunc(*));
C = foreach b generate group, count(a);


Utkarsh




-----Original Message-----
From: prashanth.rinera@gmail.com [mailto:prashanth.rinera@gmail.com] On
Behalf Of Prashanth Pappu
Sent: Thursday, June 19, 2008 12:20 PM
To: pig-user@incubator.apache.org
Subject: Performance/coding question

I have a PIG script that simply generates a lot of 'counts' over very
large
data. For example,

a = load 'data' as (x,y,z);

b1 = filter a by x==1;
b1_group = group b1 all;
b1_count = foreach b1_group generate COUNT(b1);

b2 = filter a by y==1;
b2_group = group b2 all;
b2_count = foreach b2_group generate COUNT(b2);

...etc

Suppose that we need to generate counts b1 to b1000. Now, PIG generates
1000
different hadoop jobs (one for each count). While each job finishes fast
enough, the per job overhead considerably slows down the script. So, I
have
two questions

(a) If I want to generate many counts by simply filtering the rows of
the
data - is there a better way to code this script?
(b)  Are there any PIG optimizations (current or planned) that will
cause
PIG to  generate  fewer number of jobs? Because, clearly, one can write
a
single java map-reduce job to accomplish the task.

Thanks,
Prashanth

Re: Performance/coding question

Posted by Chris Olston <ol...@yahoo-inc.com>.

Prashanth,

You can write it as a single group-by program, using a custom  
function to assign tuples to groups (i.e., if x==1, it assigns to a  
first group; if y==1, it assigns to a second group, and so on) -- if  
you require a single tuple to be placed into multiple groups, the  
function can output multiple groups for a single input.

It would look like this:

a = load ...;
b = foreach a generate flatten(my_group_func(*));
c = group b by $0;
d = foreach c generate group, COUNT(c);

-Chris


On Jun 19, 2008, at 12:20 PM, Prashanth Pappu wrote:

> I have a PIG script that simply generates a lot of 'counts' over  
> very large
> data. For example,
>
> a = load 'data' as (x,y,z);
>
> b1 = filter a by x==1;
> b1_group = group b1 all;
> b1_count = foreach b1_group generate COUNT(b1);
>
> b2 = filter a by y==1;
> b2_group = group b2 all;
> b2_count = foreach b2_group generate COUNT(b2);
>
> ...etc
>
> Suppose that we need to generate counts b1 to b1000. Now, PIG  
> generates 1000
> different hadoop jobs (one for each count). While each job finishes  
> fast
> enough, the per job overhead considerably slows down the script.  
> So, I have
> two questions
>
> (a) If I want to generate many counts by simply filtering the rows  
> of the
> data - is there a better way to code this script?
> (b)  Are there any PIG optimizations (current or planned) that will  
> cause
> PIG to  generate  fewer number of jobs? Because, clearly, one can  
> write a
> single java map-reduce job to accomplish the task.
>
> Thanks,
> Prashanth

--
Christopher Olston, Ph.D.
Sr. Research Scientist
Yahoo! Research