You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by paradisehit <pa...@163.com> on 2008/09/26 06:28:23 UTC
Only One reducer can get the total log num
I use the script like this:
querys = GROUP clear_log ALL PARALLEL 4;
TOTAL = FOREACH querys GENERATE FLATTEN(clear_log.($1, $2)), COUNT($1);
STORE TOTAL INTO 'total';
AND I see the monitor page in the hadoop jobtracker, and I see that only one reduce process the data, and other 3 reducers just process 0M data?
I think this should be changed, but how can I change it?
Help me!!
RE: Only One reducer can get the total log num
Posted by Olga Natkovich <ol...@yahoo-inc.com>.
You can see whether combiner is invoked or not bu running
explain TOTAL;
In any case, to produce a single result you need one combiner. We are
reworking the way combiner is invoked in types branch. You can try that.
Olga
> -----Original Message-----
> From: Mridul Muralidharan [mailto:mridulm@yahoo-inc.com]
> Sent: Monday, September 29, 2008 7:07 PM
> To: pig-user@incubator.apache.org
> Subject: Re: Only One reducer can get the total log num
>
>
> Does combiner get invoked in this specific case ?
> I thought it did not fit the pattern mentioned in [1] for
> invoking combiners ... assuming I am not wrong, if there are
> newer patterns where combiner is invoked, would be great if
> it gets documented some place (preferably in the bug or some
> wiki page)
>
>
> Thanks,
> Mridul
>
> [1] http://issues.apache.org/jira/browse/PIG-7
>
> Olga Natkovich wrote:
> > This is fine. Combiner is used to preaggregate the data on the map
> > side and that is done in parallel. The final result has to
> be computed
> > by a single reducer since you do want to get a single value
> in your outout.
> >
> > Olga
> >
> >> -----Original Message-----
> >> From: paradisehit [mailto:paradisehit@163.com]
> >> Sent: Thursday, September 25, 2008 9:28 PM
> >> To: pig-user
> >> Subject: Only One reducer can get the total log num
> >>
> >>
> >>
> >> I use the script like this:
> >> querys = GROUP clear_log ALL PARALLEL 4; TOTAL = FOREACH querys
> >> GENERATE FLATTEN(clear_log.($1, $2)), COUNT($1);
> >>
> >> STORE TOTAL INTO 'total';
> >>
> >> AND I see the monitor page in the hadoop jobtracker, and I
> see that
> >> only one reduce process the data, and other 3 reducers
> just process
> >> 0M data?
> >>
> >> I think this should be changed, but how can I change it?
> >>
> >> Help me!!
> >>
>
>
Re: Only One reducer can get the total log num
Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.
Does combiner get invoked in this specific case ?
I thought it did not fit the pattern mentioned in [1] for invoking
combiners ... assuming I am not wrong, if there are newer patterns where
combiner is invoked, would be great if it gets documented some place
(preferably in the bug or some wiki page)
Thanks,
Mridul
[1] http://issues.apache.org/jira/browse/PIG-7
Olga Natkovich wrote:
> This is fine. Combiner is used to preaggregate the data on the map side
> and that is done in parallel. The final result has to be computed by a
> single reducer since you do want to get a single value in your outout.
>
> Olga
>
>> -----Original Message-----
>> From: paradisehit [mailto:paradisehit@163.com]
>> Sent: Thursday, September 25, 2008 9:28 PM
>> To: pig-user
>> Subject: Only One reducer can get the total log num
>>
>>
>>
>> I use the script like this:
>> querys = GROUP clear_log ALL PARALLEL 4; TOTAL = FOREACH
>> querys GENERATE FLATTEN(clear_log.($1, $2)), COUNT($1);
>>
>> STORE TOTAL INTO 'total';
>>
>> AND I see the monitor page in the hadoop jobtracker, and I
>> see that only one reduce process the data, and other 3
>> reducers just process 0M data?
>>
>> I think this should be changed, but how can I change it?
>>
>> Help me!!
>>
RE: Only One reducer can get the total log num
Posted by Olga Natkovich <ol...@yahoo-inc.com>.
This is fine. Combiner is used to preaggregate the data on the map side
and that is done in parallel. The final result has to be computed by a
single reducer since you do want to get a single value in your outout.
Olga
> -----Original Message-----
> From: paradisehit [mailto:paradisehit@163.com]
> Sent: Thursday, September 25, 2008 9:28 PM
> To: pig-user
> Subject: Only One reducer can get the total log num
>
>
>
> I use the script like this:
> querys = GROUP clear_log ALL PARALLEL 4; TOTAL = FOREACH
> querys GENERATE FLATTEN(clear_log.($1, $2)), COUNT($1);
>
> STORE TOTAL INTO 'total';
>
> AND I see the monitor page in the hadoop jobtracker, and I
> see that only one reduce process the data, and other 3
> reducers just process 0M data?
>
> I think this should be changed, but how can I change it?
>
> Help me!!
>