You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by paradisehit <pa...@163.com> on 2008/09/26 06:28:23 UTC

Only One reducer can get the total log num

 
 
 I use the script like this:
querys = GROUP clear_log  ALL PARALLEL 4;
TOTAL = FOREACH querys GENERATE FLATTEN(clear_log.($1, $2)), COUNT($1);

STORE TOTAL INTO 'total';

AND I see the monitor page in the hadoop jobtracker, and I see that only one reduce process the data, and other 3 reducers just process 0M data?

I think this should be changed, but how can I change it? 

Help me!!

RE: Only One reducer can get the total log num

Posted by Olga Natkovich <ol...@yahoo-inc.com>.
You can see whether combiner is invoked or not bu running 

explain TOTAL;

In any case, to produce a single result you need one combiner. We are
reworking the way combiner is invoked in types branch. You can try that.


Olga

> -----Original Message-----
> From: Mridul Muralidharan [mailto:mridulm@yahoo-inc.com] 
> Sent: Monday, September 29, 2008 7:07 PM
> To: pig-user@incubator.apache.org
> Subject: Re: Only One reducer can get the total log num
> 
> 
> Does combiner get invoked in this specific case ?
> I thought it did not fit the pattern mentioned in [1] for 
> invoking combiners ... assuming I am not wrong, if there are 
> newer patterns where combiner is invoked, would be great if 
> it gets documented some place (preferably in the bug or some 
> wiki page)
> 
> 
> Thanks,
> Mridul
> 
> [1] http://issues.apache.org/jira/browse/PIG-7
> 
> Olga Natkovich wrote:
> > This is fine. Combiner is used to preaggregate the data on the map 
> > side and that is done in parallel. The final result has to 
> be computed 
> > by a single reducer since you do want to get a single value 
> in your outout.
> > 
> > Olga
> > 
> >> -----Original Message-----
> >> From: paradisehit [mailto:paradisehit@163.com]
> >> Sent: Thursday, September 25, 2008 9:28 PM
> >> To: pig-user
> >> Subject: Only One reducer can get the total log num
> >>
> >>  
> >>  
> >>  I use the script like this:
> >> querys = GROUP clear_log  ALL PARALLEL 4; TOTAL = FOREACH querys 
> >> GENERATE FLATTEN(clear_log.($1, $2)), COUNT($1);
> >>
> >> STORE TOTAL INTO 'total';
> >>
> >> AND I see the monitor page in the hadoop jobtracker, and I 
> see that 
> >> only one reduce process the data, and other 3 reducers 
> just process 
> >> 0M data?
> >>
> >> I think this should be changed, but how can I change it? 
> >>
> >> Help me!!
> >>
> 
> 

Re: Only One reducer can get the total log num

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
Does combiner get invoked in this specific case ?
I thought it did not fit the pattern mentioned in [1] for invoking 
combiners ... assuming I am not wrong, if there are newer patterns where 
combiner is invoked, would be great if it gets documented some place 
(preferably in the bug or some wiki page)


Thanks,
Mridul

[1] http://issues.apache.org/jira/browse/PIG-7

Olga Natkovich wrote:
> This is fine. Combiner is used to preaggregate the data on the map side
> and that is done in parallel. The final result has to be computed by a
> single reducer since you do want to get a single value in your outout.
> 
> Olga 
> 
>> -----Original Message-----
>> From: paradisehit [mailto:paradisehit@163.com] 
>> Sent: Thursday, September 25, 2008 9:28 PM
>> To: pig-user
>> Subject: Only One reducer can get the total log num
>>
>>  
>>  
>>  I use the script like this:
>> querys = GROUP clear_log  ALL PARALLEL 4; TOTAL = FOREACH 
>> querys GENERATE FLATTEN(clear_log.($1, $2)), COUNT($1);
>>
>> STORE TOTAL INTO 'total';
>>
>> AND I see the monitor page in the hadoop jobtracker, and I 
>> see that only one reduce process the data, and other 3 
>> reducers just process 0M data?
>>
>> I think this should be changed, but how can I change it? 
>>
>> Help me!!
>>


RE: Only One reducer can get the total log num

Posted by Olga Natkovich <ol...@yahoo-inc.com>.
This is fine. Combiner is used to preaggregate the data on the map side
and that is done in parallel. The final result has to be computed by a
single reducer since you do want to get a single value in your outout.

Olga 

> -----Original Message-----
> From: paradisehit [mailto:paradisehit@163.com] 
> Sent: Thursday, September 25, 2008 9:28 PM
> To: pig-user
> Subject: Only One reducer can get the total log num
> 
>  
>  
>  I use the script like this:
> querys = GROUP clear_log  ALL PARALLEL 4; TOTAL = FOREACH 
> querys GENERATE FLATTEN(clear_log.($1, $2)), COUNT($1);
> 
> STORE TOTAL INTO 'total';
> 
> AND I see the monitor page in the hadoop jobtracker, and I 
> see that only one reduce process the data, and other 3 
> reducers just process 0M data?
> 
> I think this should be changed, but how can I change it? 
> 
> Help me!!
>