You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Hui Qi <js...@gmail.com> on 2011/10/12 20:35:41 UTC

Is there a way to set reducer number of pig besides using parallel keyword?

Hi,
I try to set a reducer number in the following way:
java -Dmapred.reduce.tasks=8 -cp pig.jar:$HADOOP_HOME/conf
org.apache.pig.Main ./L1.pig

but it doesn't work, the reducers number remain the same the as 40, which is
the parallel number in L1.pig.(L1.pig is from pigmix).
If I delete the parallel 40 in the script, the reduce.tasks will be 2, which
I thought to be 1.

L1.pig:
-- This script tests reading from a map, flattening a bag of maps, and use
of bincond.
register pigperf.jar;
A = load '/user/pig/tests/data/pigmix/page_views' using
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
    as (user, action, timespent, query_term, ip_addr, timestamp,
        estimated_revenue, page_info, page_links);
B = foreach A generate user, (int)action as action, (map[])page_info as
page_info,
    flatten((bag{tuple(map[])})page_links) as page_links;
C = foreach B generate user,
    (action == 1 ? page_info#'a' : page_links#'b') as header;
D = group C by user parallel 40;
E = foreach D generate group, COUNT(C) as cnt;
store E into 'L1out';

Best,
Hui

Re: Is there a way to set reducer number of pig besides using parallel keyword?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Yeah, "group all" is a special case that always has parallelism of 1 (due to
the semantics of grouping by all).

On Wed, Oct 12, 2011 at 3:47 PM, Andrew Clegg <andrew.clegg+mahout@gmail.com
> wrote:

> Something I was wondering the other day... If you do a "group <blah>
> all" and then pass the result to a non-algebraic aggregate function,
> will that guarantee that all the records go to a single reducer? Or is
> it more subtle than that?
>
> On 12 October 2011 22:08, Norbert Burger <no...@gmail.com> wrote:
> > For a more detailed explanation, take a look also at
> >
> http://pig.apache.org/docs/r0.8.1/cookbook.html#Use+the+Parallel+Features.
> >
> > In summary:
> >
> > * The PARALLEL keyword at the operator level overrides any other setting
> > * SET default_parallel determines reducer count for all blocking
> operators
> > (ones that force a reduce phase)
> > * If neither of these are set, then reducer count is determined via a
> > heuristic based on total input size
> >
> > Norbert
> >
> > On Wed, Oct 12, 2011 at 5:02 PM, Dmitriy Ryaboy <dv...@gmail.com>
> wrote:
> >
> >> set default_parallel 8
> >>
> >> -D
> >>
> >> On Wed, Oct 12, 2011 at 11:35 AM, Hui Qi <js...@gmail.com> wrote:
> >>
> >> > Hi,
> >> > I try to set a reducer number in the following way:
> >> > java -Dmapred.reduce.tasks=8 -cp pig.jar:$HADOOP_HOME/conf
> >> > org.apache.pig.Main ./L1.pig
> >> >
> >> > but it doesn't work, the reducers number remain the same the as 40,
> which
> >> > is
> >> > the parallel number in L1.pig.(L1.pig is from pigmix).
> >> > If I delete the parallel 40 in the script, the reduce.tasks will be 2,
> >> > which
> >> > I thought to be 1.
> >> >
> >> > L1.pig:
> >> > -- This script tests reading from a map, flattening a bag of maps, and
> >> use
> >> > of bincond.
> >> > register pigperf.jar;
> >> > A = load '/user/pig/tests/data/pigmix/page_views' using
> >> > org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
> >> >    as (user, action, timespent, query_term, ip_addr, timestamp,
> >> >        estimated_revenue, page_info, page_links);
> >> > B = foreach A generate user, (int)action as action, (map[])page_info
> as
> >> > page_info,
> >> >    flatten((bag{tuple(map[])})page_links) as page_links;
> >> > C = foreach B generate user,
> >> >    (action == 1 ? page_info#'a' : page_links#'b') as header;
> >> > D = group C by user parallel 40;
> >> > E = foreach D generate group, COUNT(C) as cnt;
> >> > store E into 'L1out';
> >> >
> >> > Best,
> >> > Hui
> >> >
> >>
> >
>
>
>
> --
>
> http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg
>

Re: Is there a way to set reducer number of pig besides using parallel keyword?

Posted by Andrew Clegg <an...@gmail.com>.

Something I was wondering the other day... If you do a "group <blah>
all" and then pass the result to a non-algebraic aggregate function,
will that guarantee that all the records go to a single reducer? Or is
it more subtle than that?

On 12 October 2011 22:08, Norbert Burger <no...@gmail.com> wrote:
> For a more detailed explanation, take a look also at
> http://pig.apache.org/docs/r0.8.1/cookbook.html#Use+the+Parallel+Features.
>
> In summary:
>
> * The PARALLEL keyword at the operator level overrides any other setting
> * SET default_parallel determines reducer count for all blocking operators
> (ones that force a reduce phase)
> * If neither of these are set, then reducer count is determined via a
> heuristic based on total input size
>
> Norbert
>
> On Wed, Oct 12, 2011 at 5:02 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>
>> set default_parallel 8
>>
>> -D
>>
>> On Wed, Oct 12, 2011 at 11:35 AM, Hui Qi <js...@gmail.com> wrote:
>>
>> > Hi,
>> > I try to set a reducer number in the following way:
>> > java -Dmapred.reduce.tasks=8 -cp pig.jar:$HADOOP_HOME/conf
>> > org.apache.pig.Main ./L1.pig
>> >
>> > but it doesn't work, the reducers number remain the same the as 40, which
>> > is
>> > the parallel number in L1.pig.(L1.pig is from pigmix).
>> > If I delete the parallel 40 in the script, the reduce.tasks will be 2,
>> > which
>> > I thought to be 1.
>> >
>> > L1.pig:
>> > -- This script tests reading from a map, flattening a bag of maps, and
>> use
>> > of bincond.
>> > register pigperf.jar;
>> > A = load '/user/pig/tests/data/pigmix/page_views' using
>> > org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
>> >    as (user, action, timespent, query_term, ip_addr, timestamp,
>> >        estimated_revenue, page_info, page_links);
>> > B = foreach A generate user, (int)action as action, (map[])page_info as
>> > page_info,
>> >    flatten((bag{tuple(map[])})page_links) as page_links;
>> > C = foreach B generate user,
>> >    (action == 1 ? page_info#'a' : page_links#'b') as header;
>> > D = group C by user parallel 40;
>> > E = foreach D generate group, COUNT(C) as cnt;
>> > store E into 'L1out';
>> >
>> > Best,
>> > Hui
>> >
>>
>



-- 

http://tinyurl.com/andrew-clegg-linkedin | http://twitter.com/andrew_clegg

Re: Is there a way to set reducer number of pig besides using parallel keyword?

Posted by Norbert Burger <no...@gmail.com>.

For a more detailed explanation, take a look also at
http://pig.apache.org/docs/r0.8.1/cookbook.html#Use+the+Parallel+Features.

In summary:

* The PARALLEL keyword at the operator level overrides any other setting
* SET default_parallel determines reducer count for all blocking operators
(ones that force a reduce phase)
* If neither of these are set, then reducer count is determined via a
heuristic based on total input size

Norbert

On Wed, Oct 12, 2011 at 5:02 PM, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> set default_parallel 8
>
> -D
>
> On Wed, Oct 12, 2011 at 11:35 AM, Hui Qi <js...@gmail.com> wrote:
>
> > Hi,
> > I try to set a reducer number in the following way:
> > java -Dmapred.reduce.tasks=8 -cp pig.jar:$HADOOP_HOME/conf
> > org.apache.pig.Main ./L1.pig
> >
> > but it doesn't work, the reducers number remain the same the as 40, which
> > is
> > the parallel number in L1.pig.(L1.pig is from pigmix).
> > If I delete the parallel 40 in the script, the reduce.tasks will be 2,
> > which
> > I thought to be 1.
> >
> > L1.pig:
> > -- This script tests reading from a map, flattening a bag of maps, and
> use
> > of bincond.
> > register pigperf.jar;
> > A = load '/user/pig/tests/data/pigmix/page_views' using
> > org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
> >    as (user, action, timespent, query_term, ip_addr, timestamp,
> >        estimated_revenue, page_info, page_links);
> > B = foreach A generate user, (int)action as action, (map[])page_info as
> > page_info,
> >    flatten((bag{tuple(map[])})page_links) as page_links;
> > C = foreach B generate user,
> >    (action == 1 ? page_info#'a' : page_links#'b') as header;
> > D = group C by user parallel 40;
> > E = foreach D generate group, COUNT(C) as cnt;
> > store E into 'L1out';
> >
> > Best,
> > Hui
> >
>

Re: Is there a way to set reducer number of pig besides using parallel keyword?

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

set default_parallel 8

-D

On Wed, Oct 12, 2011 at 11:35 AM, Hui Qi <js...@gmail.com> wrote:

> Hi,
> I try to set a reducer number in the following way:
> java -Dmapred.reduce.tasks=8 -cp pig.jar:$HADOOP_HOME/conf
> org.apache.pig.Main ./L1.pig
>
> but it doesn't work, the reducers number remain the same the as 40, which
> is
> the parallel number in L1.pig.(L1.pig is from pigmix).
> If I delete the parallel 40 in the script, the reduce.tasks will be 2,
> which
> I thought to be 1.
>
> L1.pig:
> -- This script tests reading from a map, flattening a bag of maps, and use
> of bincond.
> register pigperf.jar;
> A = load '/user/pig/tests/data/pigmix/page_views' using
> org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
>    as (user, action, timespent, query_term, ip_addr, timestamp,
>        estimated_revenue, page_info, page_links);
> B = foreach A generate user, (int)action as action, (map[])page_info as
> page_info,
>    flatten((bag{tuple(map[])})page_links) as page_links;
> C = foreach B generate user,
>    (action == 1 ? page_info#'a' : page_links#'b') as header;
> D = group C by user parallel 40;
> E = foreach D generate group, COUNT(C) as cnt;
> store E into 'L1out';
>
> Best,
> Hui
>