You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by James Newhaven <ja...@gmail.com> on 2012/07/04 19:37:53 UTC

Using average function is really slow

Hi,

I am using the built-in org.apache.pig.builtin.AVG function. I have a set
of 100,000 items that I want to average.

The relevant pig latin is below:


L = FOREACH K GENERATE AVG(I.productcost), AVG(I.deliverycost);
STORE L INTO 'output' USING PigStorage (',');


In the Hadoop Admin Console, I can see several jobs that finish quickly (I
can see they all use many map and reduce tasks).

However, eventually Hadoop executes a job with a single map and reduce task
which is taking forever to finish (it has been running for several hours so
far). All the map and reduce tasks report 100% complete, but I can see that
one of the statistics called "Map output records" is slowly increasing and
the job status remains as 'Running'.

Could anyone provide any advice in how I could go about diagnosing the
cause of this problem? I suspect the average function is taking a long time
to execute, but I thought calculating the average of 100,000 items would
not take that long.

Thanks,
James

Re: Using average function is really slow

Posted by Ruslan Al-Fakikh <me...@gmail.com>.

Hi James,

AVG is Algebraic which means that it will use combiner when it can. It
seems that your job is not using combiner. Can you give the full
script? Also check the job config of the running job. If it is using
combiner then you should see something like
pig.job.feature=GROUP_BY,COMBINER
pig.alias=L (that would mean that the job is really about the
statement you gave, not the other statements)

Ruslan

On Wed, Jul 4, 2012 at 9:37 PM, James Newhaven <ja...@gmail.com> wrote:
> Hi,
>
> I am using the built-in org.apache.pig.builtin.AVG function. I have a set
> of 100,000 items that I want to average.
>
> The relevant pig latin is below:
>
>
> L = FOREACH K GENERATE AVG(I.productcost), AVG(I.deliverycost);
> STORE L INTO 'output' USING PigStorage (',');
>
>
> In the Hadoop Admin Console, I can see several jobs that finish quickly (I
> can see they all use many map and reduce tasks).
>
> However, eventually Hadoop executes a job with a single map and reduce task
> which is taking forever to finish (it has been running for several hours so
> far). All the map and reduce tasks report 100% complete, but I can see that
> one of the statistics called "Map output records" is slowly increasing and
> the job status remains as 'Running'.
>
> Could anyone provide any advice in how I could go about diagnosing the
> cause of this problem? I suspect the average function is taking a long time
> to execute, but I thought calculating the average of 100,000 items would
> not take that long.
>
> Thanks,
> James