You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Rajesh Balamohan <ra...@gmail.com> on 2011/09/12 07:13:02 UTC

LIMIT optimization

I have a large data set (> 2 TB) and I tried scanning 100 records from it.

a = load '/usr/largedata/' using PigStorage(',');
b = limit a 100;
dump b;

>>>>
2011-09-11 21:56:34,262 [main] INFO
 org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
script: LIMIT
2011-09-11 21:56:34,414 [main] INFO
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
File concatenation threshold: 100 optimistic? false
2011-09-11 21:56:34,483 [main] INFO
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 1
2011-09-11 21:56:34,484 [main] INFO
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 1
>>>>

This ends up launching a MR job with 20,000+ Maps and a single reducer.

Is it possible for PIG to analyze such cases and realistically scan only 100
rows (rather than scanning the entire data and emitting 100 rows?).

This is on PIG 0.9.

-- 
~Rajesh.B

Re: LIMIT optimization

Posted by Daniel Dai <da...@hortonworks.com>.
Appreciate if you can help test.  PIG-1270-2.patch should be directly
applicable to 0.9.0.

Daniel

On Sun, Sep 11, 2011 at 11:25 PM, Rajesh Balamohan
<ra...@gmail.com> wrote:
>
> Thanks Daniel for the comments.
>
>
> >> PIG-1270 is to solve 2, but performance test does not show improvement
>
> This puts a restriction on the PigRecordReader itself and prevents mappers
> from reading more data. Isn't supposed to increase the performance?. What
> was the datasize you used? If this patch is compatible with 0.9, I can try
> it on my cluster.
>
> On Mon, Sep 12, 2011 at 11:14 AM, Daniel Dai <da...@hortonworks.com> wrote:
>
> > Two ways to optimize:
> > 1. Launching less maps
> > 2. For each map, stop earlier
> >
> > PIG-1270 is to solve 2, but performance test does not show improvement. For
> > 1, in extreme case, such as 2T data only contains 100 records, launching
> > all
> > maps is necessary. Pig currently does not probe the input data before
> > launching map-reduce jobs. Maybe we can launch fewer maps as initial guess
> > and launch all maps if guess fail. Thoughts?
> >
> > Daniel
> >
> > On Sun, Sep 11, 2011 at 10:13 PM, Rajesh Balamohan <
> > rajesh.balamohan@gmail.com> wrote:
> >
> > > I have a large data set (> 2 TB) and I tried scanning 100 records from
> > it.
> > >
> > > a = load '/usr/largedata/' using PigStorage(',');
> > > b = limit a 100;
> > > dump b;
> > >
> > > >>>>
> > > 2011-09-11 21:56:34,262 [main] INFO
> > >  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> > > script: LIMIT
> > > 2011-09-11 21:56:34,414 [main] INFO
> > >  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
> > -
> > > File concatenation threshold: 100 optimistic? false
> > > 2011-09-11 21:56:34,483 [main] INFO
> > >
> > >
> >  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> > > - MR plan size before optimization: 1
> > > 2011-09-11 21:56:34,484 [main] INFO
> > >
> > >
> >  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> > > - MR plan size after optimization: 1
> > > >>>>
> > >
> > > This ends up launching a MR job with 20,000+ Maps and a single reducer.
> > >
> > > Is it possible for PIG to analyze such cases and realistically scan only
> > > 100
> > > rows (rather than scanning the entire data and emitting 100 rows?).
> > >
> > > This is on PIG 0.9.
> > >
> > > --
> > > ~Rajesh.B
> > >
> >
>
>
>
> --
> ~Rajesh.B

Re: LIMIT optimization

Posted by Rajesh Balamohan <ra...@gmail.com>.
Thanks Daniel for the comments.


>> PIG-1270 is to solve 2, but performance test does not show improvement

This puts a restriction on the PigRecordReader itself and prevents mappers
from reading more data. Isn't supposed to increase the performance?. What
was the datasize you used? If this patch is compatible with 0.9, I can try
it on my cluster.

On Mon, Sep 12, 2011 at 11:14 AM, Daniel Dai <da...@hortonworks.com> wrote:

> Two ways to optimize:
> 1. Launching less maps
> 2. For each map, stop earlier
>
> PIG-1270 is to solve 2, but performance test does not show improvement. For
> 1, in extreme case, such as 2T data only contains 100 records, launching
> all
> maps is necessary. Pig currently does not probe the input data before
> launching map-reduce jobs. Maybe we can launch fewer maps as initial guess
> and launch all maps if guess fail. Thoughts?
>
> Daniel
>
> On Sun, Sep 11, 2011 at 10:13 PM, Rajesh Balamohan <
> rajesh.balamohan@gmail.com> wrote:
>
> > I have a large data set (> 2 TB) and I tried scanning 100 records from
> it.
> >
> > a = load '/usr/largedata/' using PigStorage(',');
> > b = limit a 100;
> > dump b;
> >
> > >>>>
> > 2011-09-11 21:56:34,262 [main] INFO
> >  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> > script: LIMIT
> > 2011-09-11 21:56:34,414 [main] INFO
> >  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler
> -
> > File concatenation threshold: 100 optimistic? false
> > 2011-09-11 21:56:34,483 [main] INFO
> >
> >
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> > - MR plan size before optimization: 1
> > 2011-09-11 21:56:34,484 [main] INFO
> >
> >
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> > - MR plan size after optimization: 1
> > >>>>
> >
> > This ends up launching a MR job with 20,000+ Maps and a single reducer.
> >
> > Is it possible for PIG to analyze such cases and realistically scan only
> > 100
> > rows (rather than scanning the entire data and emitting 100 rows?).
> >
> > This is on PIG 0.9.
> >
> > --
> > ~Rajesh.B
> >
>



-- 
~Rajesh.B

Re: LIMIT optimization

Posted by Daniel Dai <da...@hortonworks.com>.
Two ways to optimize:
1. Launching less maps
2. For each map, stop earlier

PIG-1270 is to solve 2, but performance test does not show improvement. For
1, in extreme case, such as 2T data only contains 100 records, launching all
maps is necessary. Pig currently does not probe the input data before
launching map-reduce jobs. Maybe we can launch fewer maps as initial guess
and launch all maps if guess fail. Thoughts?

Daniel

On Sun, Sep 11, 2011 at 10:13 PM, Rajesh Balamohan <
rajesh.balamohan@gmail.com> wrote:

> I have a large data set (> 2 TB) and I tried scanning 100 records from it.
>
> a = load '/usr/largedata/' using PigStorage(',');
> b = limit a 100;
> dump b;
>
> >>>>
> 2011-09-11 21:56:34,262 [main] INFO
>  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> script: LIMIT
> 2011-09-11 21:56:34,414 [main] INFO
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler -
> File concatenation threshold: 100 optimistic? false
> 2011-09-11 21:56:34,483 [main] INFO
>
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - MR plan size before optimization: 1
> 2011-09-11 21:56:34,484 [main] INFO
>
>  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - MR plan size after optimization: 1
> >>>>
>
> This ends up launching a MR job with 20,000+ Maps and a single reducer.
>
> Is it possible for PIG to analyze such cases and realistically scan only
> 100
> rows (rather than scanning the entire data and emitting 100 rows?).
>
> This is on PIG 0.9.
>
> --
> ~Rajesh.B
>