You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by prasenjit mukherjee <pr...@gmail.com> on 2010/01/24 11:45:33 UTC

enforcing number of mappers

I want to use Pig to paralelize processing on a number of  requests. There
are ~ 300 request which needs to be  processed. Each processing consist of
following :
1. Fetch file from s3 to local
2. Do some preprocessing
3. Put it into hdfs

My input is a small file with 300 lines. The problem is that pig seems to be
always creating a single mapper, because of which the load is not properly
distributed. Any way I can enforce splitting of smaller input files as well
? Below is the pig output which tends to indicate that there is only 1
mapper. Let me know if my understanding is wrong.

2010-01-24 05:31:53,148 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size before optimization: 1
2010-01-24 05:31:53,148 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
- MR plan size after optimization: 1
2010-01-24 05:31:55,006 [main] INFO
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
- Setting up single store job

Thanks
-Prasen.

Re: enforcing number of mappers

Posted by prasenjit mukherjee <pm...@quattrowireless.com>.
I am thinking of writing a 2 line pig script to do the job.
r1 = LOAD '/foo/*' USING PigStorage(' ') split by 'file';
stream r1 through `myscript`;

Thinking of using  "split by 'file'" pig-command. Basically if I can  split
a single input file into many ( via using unix-split). And then write a
simple script to do s3fetch/hdfs-put and then use the "stream" operator.


Will that work distribute the load  ? Any way ( debug/log etc. )  I can know
how many nodes were being used as mapper ?

-Prasen

On Sun, Jan 24, 2010 at 5:41 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> you need to write a custom slicer that will enforce your preferred
> strategy for determining # of mappers.
>
> Once the load/store redesign goes in, slicers will go away, and you
> will write custom hadoop partitioners instead.
> -D
>
> On Sun, Jan 24, 2010 at 2:45 AM, prasenjit mukherjee
> <pr...@gmail.com> wrote:
> > I want to use Pig to paralelize processing on a number of  requests.
> There
> > are ~ 300 request which needs to be  processed. Each processing consist
> of
> > following :
> > 1. Fetch file from s3 to local
> > 2. Do some preprocessing
> > 3. Put it into hdfs
> >
> > My input is a small file with 300 lines. The problem is that pig seems to
> be
> > always creating a single mapper, because of which the load is not
> properly
> > distributed. Any way I can enforce splitting of smaller input files as
> well
> > ? Below is the pig output which tends to indicate that there is only 1
> > mapper. Let me know if my understanding is wrong.
> >
> > 2010-01-24 05:31:53,148 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> > - MR plan size before optimization: 1
> > 2010-01-24 05:31:53,148 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> > - MR plan size after optimization: 1
> > 2010-01-24 05:31:55,006 [main] INFO
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> > - Setting up single store job
> >
> > Thanks
> > -Prasen.
> >
>

Re: enforcing number of mappers

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
you need to write a custom slicer that will enforce your preferred
strategy for determining # of mappers.

Once the load/store redesign goes in, slicers will go away, and you
will write custom hadoop partitioners instead.
-D

On Sun, Jan 24, 2010 at 2:45 AM, prasenjit mukherjee
<pr...@gmail.com> wrote:
> I want to use Pig to paralelize processing on a number of  requests. There
> are ~ 300 request which needs to be  processed. Each processing consist of
> following :
> 1. Fetch file from s3 to local
> 2. Do some preprocessing
> 3. Put it into hdfs
>
> My input is a small file with 300 lines. The problem is that pig seems to be
> always creating a single mapper, because of which the load is not properly
> distributed. Any way I can enforce splitting of smaller input files as well
> ? Below is the pig output which tends to indicate that there is only 1
> mapper. Let me know if my understanding is wrong.
>
> 2010-01-24 05:31:53,148 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - MR plan size before optimization: 1
> 2010-01-24 05:31:53,148 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - MR plan size after optimization: 1
> 2010-01-24 05:31:55,006 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - Setting up single store job
>
> Thanks
> -Prasen.
>

Re: enforcing number of mappers

Posted by prasenjit mukherjee <pm...@quattrowireless.com>.
Mridul,
  Thanks for the  trick, it definitely works.  BUT I am seeing some
weirdness when I run the snippet :

> input_lines = load 'my_s3_list_file' as (location_line:chararray);
> grp_op = GROUP input_lines BY location_line PARALLEL $NUM_MAPPERS_REQUIRED;
> actual_result = FOREACH grp_op GENERATE MY_S3_UDF(group);

if I set NUM_MAPPERS_REQUIRED = 10, I only see half of that number
being used. Meaning only 5(or NUM_MAPPERS_REQUIRED/2) of them receive
non-empty input, the other half doesn't receive any input and hence
the load is only partially distributed to half of the nodes. Any
insight as to why its happening ?  Only reason I can think of is when
the hash-values of the location_lines collide ?  Hmm, a rarity but
this happens quite repetetively.

-Prasen

On Mon, Jan 25, 2010 at 8:58 AM, Mridul Muralidharan
<mr...@yahoo-inc.com> wrote:
>
> If each line from your file has to be processed by a different mapper -
> other than by writing a custom slicer, a very dirty hack would be to :
> a) create N number of files with one line each.
> b) Or, do something like :
> input_lines = load 'my_s3_list_file' as (location_line:chararray);
> grp_op = GROUP input_lines BY location_line PARALLEL $NUM_MAPPERS_REQUIRED;
> actual_result = FOREACH grp_op GENERATE MY_S3_UDF(group);
>
>
> The preferred way, as Dmitriy mentioned, would be to use a custom Slicer
> ofcourse !
>
> Regards,
> Mridul
>

Re: enforcing number of mappers

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
If each line from your file has to be processed by a different mapper - 
other than by writing a custom slicer, a very dirty hack would be to :
a) create N number of files with one line each.
b) Or, do something like :
input_lines = load 'my_s3_list_file' as (location_line:chararray);
grp_op = GROUP input_lines BY location_line PARALLEL $NUM_MAPPERS_REQUIRED;
actual_result = FOREACH grp_op GENERATE MY_S3_UDF(group);


The preferred way, as Dmitriy mentioned, would be to use a custom Slicer 
ofcourse !

Regards,
Mridul

prasenjit mukherjee wrote:
> I want to use Pig to paralelize processing on a number of  requests. There
> are ~ 300 request which needs to be  processed. Each processing consist of
> following :
> 1. Fetch file from s3 to local
> 2. Do some preprocessing
> 3. Put it into hdfs
> 
> My input is a small file with 300 lines. The problem is that pig seems to be
> always creating a single mapper, because of which the load is not properly
> distributed. Any way I can enforce splitting of smaller input files as well
> ? Below is the pig output which tends to indicate that there is only 1
> mapper. Let me know if my understanding is wrong.
> 
> 2010-01-24 05:31:53,148 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - MR plan size before optimization: 1
> 2010-01-24 05:31:53,148 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
> - MR plan size after optimization: 1
> 2010-01-24 05:31:55,006 [main] INFO
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler
> - Setting up single store job
> 
> Thanks
> -Prasen.