You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Yang <te...@gmail.com> on 2012/01/12 03:12:48 UTC

how to control the number of mappers?

I have a pig script  that does basically a map-only job:

raw = LOAD 'input.txt' ;

processed = FOREACH raw GENERATE convert_somehow($1,$2...);

store processed into 'output.txt';



I have many nodes on my cluster, so I want PIG to process the input in
more mappers. but it generates only 2 part-m-xxxxx  files, i.e.
using 2 mappers.

in hadoop job it's possible to pass mapper count and
-Dmapred.min.split.size= ,  would this also work for PIG? the PARALLEL
keyword only works for reducers


Thanks
Yang

Re: how to control the number of mappers?

Posted by Yang <te...@gmail.com>.
ok, I see, I was using pig 0.5

tried 0.9, works now

thanks!

On Tue, Jan 17, 2012 at 1:20 PM, Yang <te...@gmail.com> wrote:

> weird
>
> I tried
>
> # head a.pg
>
> set job.name 'blah';
> SET mapred.map.tasks.speculative.execution false;
> set mapred.min.split.size 10000;
>
> set mapred.tasktracker.map.tasks.maximum 10000;
>
>
> [root@]# pig a.pg
> 2012-01-17 16:19:18,407 [main] INFO  org.apache.pig.Main - Logging error
> messages to: /mnt/pig_1326835158407.log
> 2012-01-17 16:19:18,564 [main] INFO
>  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> Connecting to hadoop file system at: hdfs://
> ec2-107-22-118-169.compute-1.amazonaws.com:8020/
> 2012-01-17 16:19:18,749 [main] INFO
>  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
> Connecting to map-reduce job tracker at:
> ec2-107-22-118-169.compute-1.amazonaws.com:8021
> 2012-01-17 16:19:18,858 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1000: Error during parsing. Unrecognized set key:
> mapred.map.tasks.speculative.execution
> Details at logfile: /mnt/pig_1326835158407.log
>
>
> Pig Stack Trace
> ---------------
> ERROR 1000: Error during parsing. Unrecognized set key:
> mapred.map.tasks.speculative.execution
>
> org.apache.pig.tools.pigscript.parser.ParseException: Unrecognized set
> key: mapred.map.tasks.speculative.execution
>         at
> org.apache.pig.tools.grunt.GruntParser.processSet(GruntParser.java:459)
>         at
> org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:429)
>         at
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
>         at
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
>         at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
>         at org.apache.pig.Main.main(Main.java:397)
>
> ================================================================================
>
>
> so the job.name param is accepted, but the next one mapred.map...... was
> unrecognized.
> but that is the one I pasted from the docs page
>
>
> On Tue, Jan 17, 2012 at 1:15 PM, Dmitriy Ryaboy <dv...@gmail.com>wrote:
>
>> http://pig.apache.org/docs/r0.9.1/cmds.html#set
>>
>> "All Pig and Hadoop properties can be set, either in the Pig script or via
>> the Grunt command line."
>>
>> On Tue, Jan 17, 2012 at 12:53 PM, Yang <te...@gmail.com> wrote:
>>
>> > Prashant:
>> >
>> > I tried splitting the input files, yes that worked, and multiple mappers
>> > were indeed created.
>> >
>> > but then I would have to create a separate stage simply to split the
>> input
>> > files, so that is a bit cumbersome. it would be nice if there is some
>> > control to directly limit map file input size etc.
>> >
>> > Thanks
>> > Yang
>> >
>> > On Wed, Jan 11, 2012 at 7:46 PM, Prashant Kommireddi <
>> prash1784@gmail.com
>> > >wrote:
>> >
>> > > By block size I mean the actual HDFS block size. Based on your
>> > requirement
>> > > it seems like the input files are extremely small and reducing the
>> block
>> > > size is not an option.
>> > >
>> > > Specifying "mapred.min.split.size" would not work for both
>> Hadoop/Java MR
>> > > and Pig. Hadoop only picks the maximum of (minSplitSize, blockSize).
>> > >
>> > > Your job is more CPU intensive than I/O. I can think of splitting your
>> > > files into multiple input files (equal to # of map tasks on your
>> cluster)
>> > > and turning off splitCombination (pig.splitCombination=false). Though
>> > this
>> > > is generally a terrible MR practice!
>> > >
>> > > Another thing you could try is to give more memory to your map tasks
>> by
>> > > increasing "mapred.child.java.opts" to a higher value.
>> > >
>> > > Thanks,
>> > > Prashant
>> > >
>> > >
>> > > On Wed, Jan 11, 2012 at 6:27 PM, Yang <te...@gmail.com> wrote:
>> > >
>> > > > Prashant:
>> > > >
>> > > > thanks.
>> > > >
>> > > > by "reducing the block size", do you mean split size ? ---- block
>> size
>> > > > is fixed on a hadoop hdfs.
>> > > >
>> > > > my application is not really data heavy, each line of input takes a
>> > > > long while to process. as a result, the input size is small, but
>> total
>> > > > processing time is long, and the potential parallelism is high
>> > > >
>> > > > Yang
>> > > >
>> > > > On Wed, Jan 11, 2012 at 6:21 PM, Prashant Kommireddi
>> > > > <pr...@gmail.com> wrote:
>> > > > > Hi Yang,
>> > > > >
>> > > > > You cannot really control the number of mappers directly (depends
>> on
>> > > > > input splits), but surely can spawn more mappers in various ways,
>> > such
>> > > > > as reducing the block size or setting pig.splitCombination to
>> false
>> > > > > (this *might* create more maps).
>> > > > >
>> > > > > Level of parallelization depends on how much data the 2 mappers
>> are
>> > > > > handling. You would not want a lot of maps handling too little
>> data.
>> > > > > For eg, if your input data set is only a few MB it would not be a
>> > good
>> > > > > idea to have more than 1 or 2 maps.
>> > > > >
>> > > > > Thanks,
>> > > > > Prashant
>> > > > >
>> > > > > Sent from my iPhone
>> > > > >
>> > > > > On Jan 11, 2012, at 6:13 PM, Yang <te...@gmail.com> wrote:
>> > > > >
>> > > > >> I have a pig script  that does basically a map-only job:
>> > > > >>
>> > > > >> raw = LOAD 'input.txt' ;
>> > > > >>
>> > > > >> processed = FOREACH raw GENERATE convert_somehow($1,$2...);
>> > > > >>
>> > > > >> store processed into 'output.txt';
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >> I have many nodes on my cluster, so I want PIG to process the
>> input
>> > in
>> > > > >> more mappers. but it generates only 2 part-m-xxxxx  files, i.e.
>> > > > >> using 2 mappers.
>> > > > >>
>> > > > >> in hadoop job it's possible to pass mapper count and
>> > > > >> -Dmapred.min.split.size= ,  would this also work for PIG? the
>> > PARALLEL
>> > > > >> keyword only works for reducers
>> > > > >>
>> > > > >>
>> > > > >> Thanks
>> > > > >> Yang
>> > > >
>> > >
>> >
>>
>
>

Re: how to control the number of mappers?

Posted by Yang <te...@gmail.com>.
weird

I tried

# head a.pg

set job.name 'blah';
SET mapred.map.tasks.speculative.execution false;
set mapred.min.split.size 10000;

set mapred.tasktracker.map.tasks.maximum 10000;


[root@]# pig a.pg
2012-01-17 16:19:18,407 [main] INFO  org.apache.pig.Main - Logging error
messages to: /mnt/pig_1326835158407.log
2012-01-17 16:19:18,564 [main] INFO
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to hadoop file system at: hdfs://
ec2-107-22-118-169.compute-1.amazonaws.com:8020/
2012-01-17 16:19:18,749 [main] INFO
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine -
Connecting to map-reduce job tracker at:
ec2-107-22-118-169.compute-1.amazonaws.com:8021
2012-01-17 16:19:18,858 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1000: Error during parsing. Unrecognized set key:
mapred.map.tasks.speculative.execution
Details at logfile: /mnt/pig_1326835158407.log


Pig Stack Trace
---------------
ERROR 1000: Error during parsing. Unrecognized set key:
mapred.map.tasks.speculative.execution

org.apache.pig.tools.pigscript.parser.ParseException: Unrecognized set key:
mapred.map.tasks.speculative.execution
        at
org.apache.pig.tools.grunt.GruntParser.processSet(GruntParser.java:459)
        at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:429)
        at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
        at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
        at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
        at org.apache.pig.Main.main(Main.java:397)
================================================================================


so the job.name param is accepted, but the next one mapred.map...... was
unrecognized.
but that is the one I pasted from the docs page


On Tue, Jan 17, 2012 at 1:15 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> http://pig.apache.org/docs/r0.9.1/cmds.html#set
>
> "All Pig and Hadoop properties can be set, either in the Pig script or via
> the Grunt command line."
>
> On Tue, Jan 17, 2012 at 12:53 PM, Yang <te...@gmail.com> wrote:
>
> > Prashant:
> >
> > I tried splitting the input files, yes that worked, and multiple mappers
> > were indeed created.
> >
> > but then I would have to create a separate stage simply to split the
> input
> > files, so that is a bit cumbersome. it would be nice if there is some
> > control to directly limit map file input size etc.
> >
> > Thanks
> > Yang
> >
> > On Wed, Jan 11, 2012 at 7:46 PM, Prashant Kommireddi <
> prash1784@gmail.com
> > >wrote:
> >
> > > By block size I mean the actual HDFS block size. Based on your
> > requirement
> > > it seems like the input files are extremely small and reducing the
> block
> > > size is not an option.
> > >
> > > Specifying "mapred.min.split.size" would not work for both Hadoop/Java
> MR
> > > and Pig. Hadoop only picks the maximum of (minSplitSize, blockSize).
> > >
> > > Your job is more CPU intensive than I/O. I can think of splitting your
> > > files into multiple input files (equal to # of map tasks on your
> cluster)
> > > and turning off splitCombination (pig.splitCombination=false). Though
> > this
> > > is generally a terrible MR practice!
> > >
> > > Another thing you could try is to give more memory to your map tasks by
> > > increasing "mapred.child.java.opts" to a higher value.
> > >
> > > Thanks,
> > > Prashant
> > >
> > >
> > > On Wed, Jan 11, 2012 at 6:27 PM, Yang <te...@gmail.com> wrote:
> > >
> > > > Prashant:
> > > >
> > > > thanks.
> > > >
> > > > by "reducing the block size", do you mean split size ? ---- block
> size
> > > > is fixed on a hadoop hdfs.
> > > >
> > > > my application is not really data heavy, each line of input takes a
> > > > long while to process. as a result, the input size is small, but
> total
> > > > processing time is long, and the potential parallelism is high
> > > >
> > > > Yang
> > > >
> > > > On Wed, Jan 11, 2012 at 6:21 PM, Prashant Kommireddi
> > > > <pr...@gmail.com> wrote:
> > > > > Hi Yang,
> > > > >
> > > > > You cannot really control the number of mappers directly (depends
> on
> > > > > input splits), but surely can spawn more mappers in various ways,
> > such
> > > > > as reducing the block size or setting pig.splitCombination to false
> > > > > (this *might* create more maps).
> > > > >
> > > > > Level of parallelization depends on how much data the 2 mappers are
> > > > > handling. You would not want a lot of maps handling too little
> data.
> > > > > For eg, if your input data set is only a few MB it would not be a
> > good
> > > > > idea to have more than 1 or 2 maps.
> > > > >
> > > > > Thanks,
> > > > > Prashant
> > > > >
> > > > > Sent from my iPhone
> > > > >
> > > > > On Jan 11, 2012, at 6:13 PM, Yang <te...@gmail.com> wrote:
> > > > >
> > > > >> I have a pig script  that does basically a map-only job:
> > > > >>
> > > > >> raw = LOAD 'input.txt' ;
> > > > >>
> > > > >> processed = FOREACH raw GENERATE convert_somehow($1,$2...);
> > > > >>
> > > > >> store processed into 'output.txt';
> > > > >>
> > > > >>
> > > > >>
> > > > >> I have many nodes on my cluster, so I want PIG to process the
> input
> > in
> > > > >> more mappers. but it generates only 2 part-m-xxxxx  files, i.e.
> > > > >> using 2 mappers.
> > > > >>
> > > > >> in hadoop job it's possible to pass mapper count and
> > > > >> -Dmapred.min.split.size= ,  would this also work for PIG? the
> > PARALLEL
> > > > >> keyword only works for reducers
> > > > >>
> > > > >>
> > > > >> Thanks
> > > > >> Yang
> > > >
> > >
> >
>

Re: how to control the number of mappers?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
http://pig.apache.org/docs/r0.9.1/cmds.html#set

"All Pig and Hadoop properties can be set, either in the Pig script or via
the Grunt command line."

On Tue, Jan 17, 2012 at 12:53 PM, Yang <te...@gmail.com> wrote:

> Prashant:
>
> I tried splitting the input files, yes that worked, and multiple mappers
> were indeed created.
>
> but then I would have to create a separate stage simply to split the input
> files, so that is a bit cumbersome. it would be nice if there is some
> control to directly limit map file input size etc.
>
> Thanks
> Yang
>
> On Wed, Jan 11, 2012 at 7:46 PM, Prashant Kommireddi <prash1784@gmail.com
> >wrote:
>
> > By block size I mean the actual HDFS block size. Based on your
> requirement
> > it seems like the input files are extremely small and reducing the block
> > size is not an option.
> >
> > Specifying "mapred.min.split.size" would not work for both Hadoop/Java MR
> > and Pig. Hadoop only picks the maximum of (minSplitSize, blockSize).
> >
> > Your job is more CPU intensive than I/O. I can think of splitting your
> > files into multiple input files (equal to # of map tasks on your cluster)
> > and turning off splitCombination (pig.splitCombination=false). Though
> this
> > is generally a terrible MR practice!
> >
> > Another thing you could try is to give more memory to your map tasks by
> > increasing "mapred.child.java.opts" to a higher value.
> >
> > Thanks,
> > Prashant
> >
> >
> > On Wed, Jan 11, 2012 at 6:27 PM, Yang <te...@gmail.com> wrote:
> >
> > > Prashant:
> > >
> > > thanks.
> > >
> > > by "reducing the block size", do you mean split size ? ---- block size
> > > is fixed on a hadoop hdfs.
> > >
> > > my application is not really data heavy, each line of input takes a
> > > long while to process. as a result, the input size is small, but total
> > > processing time is long, and the potential parallelism is high
> > >
> > > Yang
> > >
> > > On Wed, Jan 11, 2012 at 6:21 PM, Prashant Kommireddi
> > > <pr...@gmail.com> wrote:
> > > > Hi Yang,
> > > >
> > > > You cannot really control the number of mappers directly (depends on
> > > > input splits), but surely can spawn more mappers in various ways,
> such
> > > > as reducing the block size or setting pig.splitCombination to false
> > > > (this *might* create more maps).
> > > >
> > > > Level of parallelization depends on how much data the 2 mappers are
> > > > handling. You would not want a lot of maps handling too little data.
> > > > For eg, if your input data set is only a few MB it would not be a
> good
> > > > idea to have more than 1 or 2 maps.
> > > >
> > > > Thanks,
> > > > Prashant
> > > >
> > > > Sent from my iPhone
> > > >
> > > > On Jan 11, 2012, at 6:13 PM, Yang <te...@gmail.com> wrote:
> > > >
> > > >> I have a pig script  that does basically a map-only job:
> > > >>
> > > >> raw = LOAD 'input.txt' ;
> > > >>
> > > >> processed = FOREACH raw GENERATE convert_somehow($1,$2...);
> > > >>
> > > >> store processed into 'output.txt';
> > > >>
> > > >>
> > > >>
> > > >> I have many nodes on my cluster, so I want PIG to process the input
> in
> > > >> more mappers. but it generates only 2 part-m-xxxxx  files, i.e.
> > > >> using 2 mappers.
> > > >>
> > > >> in hadoop job it's possible to pass mapper count and
> > > >> -Dmapred.min.split.size= ,  would this also work for PIG? the
> PARALLEL
> > > >> keyword only works for reducers
> > > >>
> > > >>
> > > >> Thanks
> > > >> Yang
> > >
> >
>

Re: how to control the number of mappers?

Posted by Yang <te...@gmail.com>.
Prashant:

I tried splitting the input files, yes that worked, and multiple mappers
were indeed created.

but then I would have to create a separate stage simply to split the input
files, so that is a bit cumbersome. it would be nice if there is some
control to directly limit map file input size etc.

Thanks
Yang

On Wed, Jan 11, 2012 at 7:46 PM, Prashant Kommireddi <pr...@gmail.com>wrote:

> By block size I mean the actual HDFS block size. Based on your requirement
> it seems like the input files are extremely small and reducing the block
> size is not an option.
>
> Specifying "mapred.min.split.size" would not work for both Hadoop/Java MR
> and Pig. Hadoop only picks the maximum of (minSplitSize, blockSize).
>
> Your job is more CPU intensive than I/O. I can think of splitting your
> files into multiple input files (equal to # of map tasks on your cluster)
> and turning off splitCombination (pig.splitCombination=false). Though this
> is generally a terrible MR practice!
>
> Another thing you could try is to give more memory to your map tasks by
> increasing "mapred.child.java.opts" to a higher value.
>
> Thanks,
> Prashant
>
>
> On Wed, Jan 11, 2012 at 6:27 PM, Yang <te...@gmail.com> wrote:
>
> > Prashant:
> >
> > thanks.
> >
> > by "reducing the block size", do you mean split size ? ---- block size
> > is fixed on a hadoop hdfs.
> >
> > my application is not really data heavy, each line of input takes a
> > long while to process. as a result, the input size is small, but total
> > processing time is long, and the potential parallelism is high
> >
> > Yang
> >
> > On Wed, Jan 11, 2012 at 6:21 PM, Prashant Kommireddi
> > <pr...@gmail.com> wrote:
> > > Hi Yang,
> > >
> > > You cannot really control the number of mappers directly (depends on
> > > input splits), but surely can spawn more mappers in various ways, such
> > > as reducing the block size or setting pig.splitCombination to false
> > > (this *might* create more maps).
> > >
> > > Level of parallelization depends on how much data the 2 mappers are
> > > handling. You would not want a lot of maps handling too little data.
> > > For eg, if your input data set is only a few MB it would not be a good
> > > idea to have more than 1 or 2 maps.
> > >
> > > Thanks,
> > > Prashant
> > >
> > > Sent from my iPhone
> > >
> > > On Jan 11, 2012, at 6:13 PM, Yang <te...@gmail.com> wrote:
> > >
> > >> I have a pig script  that does basically a map-only job:
> > >>
> > >> raw = LOAD 'input.txt' ;
> > >>
> > >> processed = FOREACH raw GENERATE convert_somehow($1,$2...);
> > >>
> > >> store processed into 'output.txt';
> > >>
> > >>
> > >>
> > >> I have many nodes on my cluster, so I want PIG to process the input in
> > >> more mappers. but it generates only 2 part-m-xxxxx  files, i.e.
> > >> using 2 mappers.
> > >>
> > >> in hadoop job it's possible to pass mapper count and
> > >> -Dmapred.min.split.size= ,  would this also work for PIG? the PARALLEL
> > >> keyword only works for reducers
> > >>
> > >>
> > >> Thanks
> > >> Yang
> >
>

Re: how to control the number of mappers?

Posted by Prashant Kommireddi <pr...@gmail.com>.
By block size I mean the actual HDFS block size. Based on your requirement
it seems like the input files are extremely small and reducing the block
size is not an option.

Specifying "mapred.min.split.size" would not work for both Hadoop/Java MR
and Pig. Hadoop only picks the maximum of (minSplitSize, blockSize).

Your job is more CPU intensive than I/O. I can think of splitting your
files into multiple input files (equal to # of map tasks on your cluster)
and turning off splitCombination (pig.splitCombination=false). Though this
is generally a terrible MR practice!

Another thing you could try is to give more memory to your map tasks by
increasing "mapred.child.java.opts" to a higher value.

Thanks,
Prashant


On Wed, Jan 11, 2012 at 6:27 PM, Yang <te...@gmail.com> wrote:

> Prashant:
>
> thanks.
>
> by "reducing the block size", do you mean split size ? ---- block size
> is fixed on a hadoop hdfs.
>
> my application is not really data heavy, each line of input takes a
> long while to process. as a result, the input size is small, but total
> processing time is long, and the potential parallelism is high
>
> Yang
>
> On Wed, Jan 11, 2012 at 6:21 PM, Prashant Kommireddi
> <pr...@gmail.com> wrote:
> > Hi Yang,
> >
> > You cannot really control the number of mappers directly (depends on
> > input splits), but surely can spawn more mappers in various ways, such
> > as reducing the block size or setting pig.splitCombination to false
> > (this *might* create more maps).
> >
> > Level of parallelization depends on how much data the 2 mappers are
> > handling. You would not want a lot of maps handling too little data.
> > For eg, if your input data set is only a few MB it would not be a good
> > idea to have more than 1 or 2 maps.
> >
> > Thanks,
> > Prashant
> >
> > Sent from my iPhone
> >
> > On Jan 11, 2012, at 6:13 PM, Yang <te...@gmail.com> wrote:
> >
> >> I have a pig script  that does basically a map-only job:
> >>
> >> raw = LOAD 'input.txt' ;
> >>
> >> processed = FOREACH raw GENERATE convert_somehow($1,$2...);
> >>
> >> store processed into 'output.txt';
> >>
> >>
> >>
> >> I have many nodes on my cluster, so I want PIG to process the input in
> >> more mappers. but it generates only 2 part-m-xxxxx  files, i.e.
> >> using 2 mappers.
> >>
> >> in hadoop job it's possible to pass mapper count and
> >> -Dmapred.min.split.size= ,  would this also work for PIG? the PARALLEL
> >> keyword only works for reducers
> >>
> >>
> >> Thanks
> >> Yang
>

Re: how to control the number of mappers?

Posted by Yang <te...@gmail.com>.
Prashant:

thanks.

by "reducing the block size", do you mean split size ? ---- block size
is fixed on a hadoop hdfs.

my application is not really data heavy, each line of input takes a
long while to process. as a result, the input size is small, but total
processing time is long, and the potential parallelism is high

Yang

On Wed, Jan 11, 2012 at 6:21 PM, Prashant Kommireddi
<pr...@gmail.com> wrote:
> Hi Yang,
>
> You cannot really control the number of mappers directly (depends on
> input splits), but surely can spawn more mappers in various ways, such
> as reducing the block size or setting pig.splitCombination to false
> (this *might* create more maps).
>
> Level of parallelization depends on how much data the 2 mappers are
> handling. You would not want a lot of maps handling too little data.
> For eg, if your input data set is only a few MB it would not be a good
> idea to have more than 1 or 2 maps.
>
> Thanks,
> Prashant
>
> Sent from my iPhone
>
> On Jan 11, 2012, at 6:13 PM, Yang <te...@gmail.com> wrote:
>
>> I have a pig script  that does basically a map-only job:
>>
>> raw = LOAD 'input.txt' ;
>>
>> processed = FOREACH raw GENERATE convert_somehow($1,$2...);
>>
>> store processed into 'output.txt';
>>
>>
>>
>> I have many nodes on my cluster, so I want PIG to process the input in
>> more mappers. but it generates only 2 part-m-xxxxx  files, i.e.
>> using 2 mappers.
>>
>> in hadoop job it's possible to pass mapper count and
>> -Dmapred.min.split.size= ,  would this also work for PIG? the PARALLEL
>> keyword only works for reducers
>>
>>
>> Thanks
>> Yang

Re: how to control the number of mappers?

Posted by Prashant Kommireddi <pr...@gmail.com>.
Hi Yang,

You cannot really control the number of mappers directly (depends on
input splits), but surely can spawn more mappers in various ways, such
as reducing the block size or setting pig.splitCombination to false
(this *might* create more maps).

Level of parallelization depends on how much data the 2 mappers are
handling. You would not want a lot of maps handling too little data.
For eg, if your input data set is only a few MB it would not be a good
idea to have more than 1 or 2 maps.

Thanks,
Prashant

Sent from my iPhone

On Jan 11, 2012, at 6:13 PM, Yang <te...@gmail.com> wrote:

> I have a pig script  that does basically a map-only job:
>
> raw = LOAD 'input.txt' ;
>
> processed = FOREACH raw GENERATE convert_somehow($1,$2...);
>
> store processed into 'output.txt';
>
>
>
> I have many nodes on my cluster, so I want PIG to process the input in
> more mappers. but it generates only 2 part-m-xxxxx  files, i.e.
> using 2 mappers.
>
> in hadoop job it's possible to pass mapper count and
> -Dmapred.min.split.size= ,  would this also work for PIG? the PARALLEL
> keyword only works for reducers
>
>
> Thanks
> Yang

Re: how to control the number of mappers?

Posted by Yang <te...@gmail.com>.
thanks, but from http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#set
it looks the params that can be 'set' is very limited, and does not contain
the min split size  and mapper count that I want



On Wed, Jan 11, 2012 at 9:52 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> Yes, you can use the "set" keyword to set such properties in the script.
>
> On Jan 11, 2012, at 6:12 PM, Yang <te...@gmail.com> wrote:
>
> > I have a pig script  that does basically a map-only job:
> >
> > raw = LOAD 'input.txt' ;
> >
> > processed = FOREACH raw GENERATE convert_somehow($1,$2...);
> >
> > store processed into 'output.txt';
> >
> >
> >
> > I have many nodes on my cluster, so I want PIG to process the input in
> > more mappers. but it generates only 2 part-m-xxxxx  files, i.e.
> > using 2 mappers.
> >
> > in hadoop job it's possible to pass mapper count and
> > -Dmapred.min.split.size= ,  would this also work for PIG? the PARALLEL
> > keyword only works for reducers
> >
> >
> > Thanks
> > Yang
>

Re: how to control the number of mappers?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Yes, you can use the "set" keyword to set such properties in the script. 

On Jan 11, 2012, at 6:12 PM, Yang <te...@gmail.com> wrote:

> I have a pig script  that does basically a map-only job:
> 
> raw = LOAD 'input.txt' ;
> 
> processed = FOREACH raw GENERATE convert_somehow($1,$2...);
> 
> store processed into 'output.txt';
> 
> 
> 
> I have many nodes on my cluster, so I want PIG to process the input in
> more mappers. but it generates only 2 part-m-xxxxx  files, i.e.
> using 2 mappers.
> 
> in hadoop job it's possible to pass mapper count and
> -Dmapred.min.split.size= ,  would this also work for PIG? the PARALLEL
> keyword only works for reducers
> 
> 
> Thanks
> Yang