You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by ts...@o2.pl on 2013/03/15 10:58:59 UTC

How to control a number of reducers in Apache Pig

Dear Apache Pig Users,

It is easy to control a number of reducers in JOIN, GROUP, COGROUP,
etc. statements by a general "set default_parallel $NUM" command or
"parallel $NUM" info in the end of line.

However, I am interested in controlling number of reducers in a
foreach statement.
The case is as follows:
* on CDH 4.0.1. with Pig 0.9.2.
* read one sequence file (of many equivalent files) of about 400GB,
* proceed each element in UDF __using as many reducers as possible__
* store the results

Apache Pig script implementing this case -- which gives __only one__
reducer -- is below:
------------------------------------------------
SET default_parallel 16;
REGISTER myjar.jar;
input_pairs = LOAD '$input' USING
pl.example.MySequenceFileLoader('org.apache.hadoop.io.BytesWritable',
'org.apache.hadoop.io.BytesWritable') as (key:chararray,
value:bytearray);
input_protos  = FOREACH input_pairs GENERATE
FLATTEN(pl.example.ReadProtobuf(value));
output_protos = FOREACH input_protos GENERATE
FLATTEN(pl.example.XMLGenerator(*));
STORE output_protos INTO '$output' USING PigStorage();
------------------------------------------------

As far as I know "set mapred.reduce.tasks 5" can only limit a max
number of reducers

Could you give me some advice? Am I missing something?

Re: How to control a number of reducers in Apache Pig

Posted by Jonathan Coveney <jc...@gmail.com>.

The script you posted wouldn't have any reducers, so it wouldn't matter.
It's a map only job.


2013/3/15 <ts...@o2.pl>

> Dear Apache Pig Users,
>
> It is easy to control a number of reducers in JOIN, GROUP, COGROUP,
> etc. statements by a general "set default_parallel $NUM" command or
> "parallel $NUM" info in the end of line.
>
> However, I am interested in controlling number of reducers in a
> foreach statement.
> The case is as follows:
> * on CDH 4.0.1. with Pig 0.9.2.
> * read one sequence file (of many equivalent files) of about 400GB,
> * proceed each element in UDF __using as many reducers as possible__
> * store the results
>
> Apache Pig script implementing this case -- which gives __only one__
> reducer -- is below:
> ------------------------------------------------
> SET default_parallel 16;
> REGISTER myjar.jar;
> input_pairs = LOAD '$input' USING
> pl.example.MySequenceFileLoader('org.apache.hadoop.io.BytesWritable',
> 'org.apache.hadoop.io.BytesWritable') as (key:chararray,
> value:bytearray);
> input_protos  = FOREACH input_pairs GENERATE
> FLATTEN(pl.example.ReadProtobuf(value));
> output_protos = FOREACH input_protos GENERATE
> FLATTEN(pl.example.XMLGenerator(*));
> STORE output_protos INTO '$output' USING PigStorage();
> ------------------------------------------------
>
> As far as I know "set mapred.reduce.tasks 5" can only limit a max
> number of reducers
>
> Could you give me some advice? Am I missing something?
>