You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by praveenesh kumar <pr...@gmail.com> on 2012/01/16 06:47:12 UTC

Pig job is taking more time than Java M/R

Hey Guys,

Is there anyway through which I can see the M/R jobs that pig runs
internally for a given pig script ?
I wanted to get unique values for a particular column.

For that I wrote the following script:

Data = Load 'Data.csv' using PigStorage(',');
IDs = FOREACH Data GENERATE $0;
UniqueID = Distinct IDs;
Dump UniqueID;

Is it the write/best way to get unique values of a particular column ?

The reason why I am asking is, I ran the above script on my cluster, it
took around 30 minutes to finish.
However, for the same thing, when I wrote traditional java M/R code, it
took only 10 minutes.

So I want to see what Pig is doing internally.
Can anyone tell what could be the reason for such behaviour ? How can I
decrease Pig Execution time ?

Thanks,
Praveenesh

Re: Pig job is taking more time than Java M/R

Posted by praveenesh kumar <pr...@gmail.com>.

Using PARALLEL I can reduce the total number of reducers and can reduce the
execution time.

Thanks,
Praveenesh

On Mon, Jan 16, 2012 at 12:43 PM, Prashant Kommireddi
<pr...@gmail.com>wrote:

> That explains it. Pig computes # of reducers using a heuristic based on
> input dataset size (1 reducer per GB). You would always want to use
> PARALLEL if the data being forwarded to reducers is not a lot.
>
> Please take a look at (PARALLEL syntax) :
> http://pig.apache.org/docs/r0.9.1/basic.html#distinct
>
> Thanks,
> Prashant
>
> On Sun, Jan 15, 2012 at 10:59 PM, praveenesh kumar <praveenesh@gmail.com
> >wrote:
>
> > I am using Apache Pig version 0.11.0-SNAPSHOT (r1225753) build from trunk
> > and Hadoop 0.20.205
> > Nothing else was running that time on cluster that time. and there was no
> > waiting for map-reduce slots.
> > Only difference I saw was for my Java M/R job, only 40 reducers were
> > running
> > whereas my pig job was running 457 reducers. I guess it may be because of
> > so many reducers running.
> > Can I control number of reducers running ?
> >
> > Thanks,
> > Praveenesh
> >
> >
> > On Mon, Jan 16, 2012 at 11:42 AM, Prashant Kommireddi
> > <pr...@gmail.com>wrote:
> >
> > > Hi Praveenesh,
> > >
> > > You can use 'EXPLAIN' to understand what Pig is doing under the hood
> (MR
> > > plan)
> > > http://pig.apache.org/docs/r0.9.1/test.html#explain
> > >
> > > What version of Pig and Hadoop are you using? I have never seen such a
> > huge
> > > difference between Java MR and Pig. At the time you ran Pig, was the
> > > cluster idle or did you have other jobs running at the same time? Did
> you
> > > make sure the job was not waiting on Map or Reduce slots being made
> > > available?
> > >
> > > Thanks,
> > > Prashant
> > >
> > > On Sun, Jan 15, 2012 at 9:47 PM, praveenesh kumar <
> praveenesh@gmail.com
> > > >wrote:
> > >
> > > > Hey Guys,
> > > >
> > > > Is there anyway through which I can see the M/R jobs that pig runs
> > > > internally for a given pig script ?
> > > > I wanted to get unique values for a particular column.
> > > >
> > > > For that I wrote the following script:
> > > >
> > > > Data = Load 'Data.csv' using PigStorage(',');
> > > > IDs = FOREACH Data GENERATE $0;
> > > > UniqueID = Distinct IDs;
> > > > Dump UniqueID;
> > > >
> > > > Is it the write/best way to get unique values of a particular column
> ?
> > > >
> > > > The reason why I am asking is, I ran the above script on my cluster,
> it
> > > > took around 30 minutes to finish.
> > > > However, for the same thing, when I wrote traditional java M/R code,
> it
> > > > took only 10 minutes.
> > > >
> > > > So I want to see what Pig is doing internally.
> > > > Can anyone tell what could be the reason for such behaviour ? How
> can I
> > > > decrease Pig Execution time ?
> > > >
> > > > Thanks,
> > > > Praveenesh
> > > >
> > >
> >
>

Re: Pig job is taking more time than Java M/R

Posted by Prashant Kommireddi <pr...@gmail.com>.

That explains it. Pig computes # of reducers using a heuristic based on
input dataset size (1 reducer per GB). You would always want to use
PARALLEL if the data being forwarded to reducers is not a lot.

Please take a look at (PARALLEL syntax) :
http://pig.apache.org/docs/r0.9.1/basic.html#distinct

Thanks,
Prashant

On Sun, Jan 15, 2012 at 10:59 PM, praveenesh kumar <pr...@gmail.com>wrote:

> I am using Apache Pig version 0.11.0-SNAPSHOT (r1225753) build from trunk
> and Hadoop 0.20.205
> Nothing else was running that time on cluster that time. and there was no
> waiting for map-reduce slots.
> Only difference I saw was for my Java M/R job, only 40 reducers were
> running
> whereas my pig job was running 457 reducers. I guess it may be because of
> so many reducers running.
> Can I control number of reducers running ?
>
> Thanks,
> Praveenesh
>
>
> On Mon, Jan 16, 2012 at 11:42 AM, Prashant Kommireddi
> <pr...@gmail.com>wrote:
>
> > Hi Praveenesh,
> >
> > You can use 'EXPLAIN' to understand what Pig is doing under the hood (MR
> > plan)
> > http://pig.apache.org/docs/r0.9.1/test.html#explain
> >
> > What version of Pig and Hadoop are you using? I have never seen such a
> huge
> > difference between Java MR and Pig. At the time you ran Pig, was the
> > cluster idle or did you have other jobs running at the same time? Did you
> > make sure the job was not waiting on Map or Reduce slots being made
> > available?
> >
> > Thanks,
> > Prashant
> >
> > On Sun, Jan 15, 2012 at 9:47 PM, praveenesh kumar <praveenesh@gmail.com
> > >wrote:
> >
> > > Hey Guys,
> > >
> > > Is there anyway through which I can see the M/R jobs that pig runs
> > > internally for a given pig script ?
> > > I wanted to get unique values for a particular column.
> > >
> > > For that I wrote the following script:
> > >
> > > Data = Load 'Data.csv' using PigStorage(',');
> > > IDs = FOREACH Data GENERATE $0;
> > > UniqueID = Distinct IDs;
> > > Dump UniqueID;
> > >
> > > Is it the write/best way to get unique values of a particular column ?
> > >
> > > The reason why I am asking is, I ran the above script on my cluster, it
> > > took around 30 minutes to finish.
> > > However, for the same thing, when I wrote traditional java M/R code, it
> > > took only 10 minutes.
> > >
> > > So I want to see what Pig is doing internally.
> > > Can anyone tell what could be the reason for such behaviour ? How can I
> > > decrease Pig Execution time ?
> > >
> > > Thanks,
> > > Praveenesh
> > >
> >
>

Re: Pig job is taking more time than Java M/R

Posted by praveenesh kumar <pr...@gmail.com>.

I am using Apache Pig version 0.11.0-SNAPSHOT (r1225753) build from trunk
and Hadoop 0.20.205
Nothing else was running that time on cluster that time. and there was no
waiting for map-reduce slots.
Only difference I saw was for my Java M/R job, only 40 reducers were running
whereas my pig job was running 457 reducers. I guess it may be because of
so many reducers running.
Can I control number of reducers running ?

Thanks,
Praveenesh


On Mon, Jan 16, 2012 at 11:42 AM, Prashant Kommireddi
<pr...@gmail.com>wrote:

> Hi Praveenesh,
>
> You can use 'EXPLAIN' to understand what Pig is doing under the hood (MR
> plan)
> http://pig.apache.org/docs/r0.9.1/test.html#explain
>
> What version of Pig and Hadoop are you using? I have never seen such a huge
> difference between Java MR and Pig. At the time you ran Pig, was the
> cluster idle or did you have other jobs running at the same time? Did you
> make sure the job was not waiting on Map or Reduce slots being made
> available?
>
> Thanks,
> Prashant
>
> On Sun, Jan 15, 2012 at 9:47 PM, praveenesh kumar <praveenesh@gmail.com
> >wrote:
>
> > Hey Guys,
> >
> > Is there anyway through which I can see the M/R jobs that pig runs
> > internally for a given pig script ?
> > I wanted to get unique values for a particular column.
> >
> > For that I wrote the following script:
> >
> > Data = Load 'Data.csv' using PigStorage(',');
> > IDs = FOREACH Data GENERATE $0;
> > UniqueID = Distinct IDs;
> > Dump UniqueID;
> >
> > Is it the write/best way to get unique values of a particular column ?
> >
> > The reason why I am asking is, I ran the above script on my cluster, it
> > took around 30 minutes to finish.
> > However, for the same thing, when I wrote traditional java M/R code, it
> > took only 10 minutes.
> >
> > So I want to see what Pig is doing internally.
> > Can anyone tell what could be the reason for such behaviour ? How can I
> > decrease Pig Execution time ?
> >
> > Thanks,
> > Praveenesh
> >
>

Re: Pig job is taking more time than Java M/R

Posted by Prashant Kommireddi <pr...@gmail.com>.

Hi Praveenesh,

You can use 'EXPLAIN' to understand what Pig is doing under the hood (MR
plan)
http://pig.apache.org/docs/r0.9.1/test.html#explain

What version of Pig and Hadoop are you using? I have never seen such a huge
difference between Java MR and Pig. At the time you ran Pig, was the
cluster idle or did you have other jobs running at the same time? Did you
make sure the job was not waiting on Map or Reduce slots being made
available?

Thanks,
Prashant

On Sun, Jan 15, 2012 at 9:47 PM, praveenesh kumar <pr...@gmail.com>wrote:

> Hey Guys,
>
> Is there anyway through which I can see the M/R jobs that pig runs
> internally for a given pig script ?
> I wanted to get unique values for a particular column.
>
> For that I wrote the following script:
>
> Data = Load 'Data.csv' using PigStorage(',');
> IDs = FOREACH Data GENERATE $0;
> UniqueID = Distinct IDs;
> Dump UniqueID;
>
> Is it the write/best way to get unique values of a particular column ?
>
> The reason why I am asking is, I ran the above script on my cluster, it
> took around 30 minutes to finish.
> However, for the same thing, when I wrote traditional java M/R code, it
> took only 10 minutes.
>
> So I want to see what Pig is doing internally.
> Can anyone tell what could be the reason for such behaviour ? How can I
> decrease Pig Execution time ?
>
> Thanks,
> Praveenesh
>