You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Jean-Pierre OCALAN <je...@247realmedia.com> on 2008/03/21 15:49:21 UTC

[core-user][reduce seems to run only on one machine]

Hi,

I'm currently working on a project that implies massive log parsing. I  
have one master and 6 slaves.
By looking the each slaves logs I've noticed that REDUCE operation  
just runs on one machine.
So does that mean that reduce just runs on one machine ? And if that  
is true how can I specify that I want the reduce to run also on the  
other machines ?

Thanks for any help,

Jean-Pierre.

Re: [core-user][reduce seems to run only on one machine]

Posted by Jean-Pierre OCALAN <je...@247realmedia.com>.

Thank you guys for all that good answers, I appreciate that.

Jean-Pierre.

On Mar 21, 2008, at 12:47 PM, Ted Dunning wrote:

>
> The default number of reducers is 4.  It is unlikely that a user who  
> doesn't
> know about how to set the number of reducers has changed that value.
>
> This phenomenon of apparently having only a single reducer often  
> happens if
> you have a very skewed distribution of keys for the reduce phase.
>
> Imagine that you are counting words, but the text is almost entirely  
> made up
> of a single word.  If you don't have a combiner, then all instances  
> of that
> word will go to a single reducer.  The other reducers will finish  
> instantly
> so you may not even notice that they ran.
>
> You have several options:
>
> A) fix your program.  Most of the times that I have seen this, it  
> was due to
> a mistake on my part where I was outputting the wrong value as the  
> reduce
> key.  I shouldn't admit this, but it is true.
>
> B) fix your program.  Many times, you can use a combiner to do a lot  
> of the
> work of the reducer as the data is emitted from the mapper.  This  
> works is
> the reducer could be done in pieces by summarizing the pieces so  
> that the
> summarization is done by the combiners (very much in parallel) and the
> actual reducer only combines the summary (hopefully very quick, even  
> if not
> in parallel).
>
> C) fix your problem.  Sometimes, your problem as stated is not  
> particularly
> appropriate for parallel execution, but there is a nearly equivalent
> statement that is just fine.  For example, suppose that your input  
> contains
> a large list of keys and numbers and you want to compute the median  
> of these
> numbers for each key.  There is no sufficient statistic for  
> computing the
> (exact) median so you can't really use a combiner and if one key  
> dominates,
> your program will run slowly.
>
> On the other hand, if you restate the problem to compute the  
> *approximate*
> median, then you could use a combiner.  For instance, suppose the  
> combiner
> finds the median of the values it is given and emits the median and  
> the
> count of samples it processed.  Then the reducer can compute the  
> median of
> the partial medians, respecting the counts in the process.  This  
> results is
> *not* the median and your program will be *wrong*.  But not by  
> much.  If you
> can accept the error (and it will probably be quite small), then you  
> win.
> If you can't accept this error, then option (D) is for you.
>
> D) give up.  Your problem may not be appropriate for map-reduce  
> execution or
> you may be unable to afford the effort required to restate your  
> problem in a
> way that will work.  This may not be a bad option ... your program  
> may not
> even need parallelism.  Be sure to not overstate the situation.   
> Don't say
> "my problem is impossible to parallelize" since some clever Jack is  
> likely
> to come along 10 minutes later and make you look the fool.  Instead  
> say "My
> problem appeared to be difficult to parallelize so I punted".
>
>
>
> On 3/21/08 8:05 AM, "Amar Kamat" <am...@yahoo-inc.com> wrote:
>
>> On Fri, 21 Mar 2008, Jean-Pierre OCALAN wrote:
>>
>>> Hi,
>>>
>>> I'm currently working on a project that implies massive log  
>>> parsing. I have
>>> one master and 6 slaves.
>>> By looking the each slaves logs I've noticed that REDUCE operation  
>>> just runs
>>> on one machine.
>>> So does that mean that reduce just runs on one machine ? And if  
>>> that is true
>>> how can I specify that I want the reduce to run also on the other  
>>> machines ?
>> You can set it in your job using  
>> JobConf.setNumReduceTasks(numReducers).
>> How are you submitting your job? Is it from examples or your own  
>> code?
>> Amar
>>>
>>> Thanks for any help,
>>>
>>> Jean-Pierre.
>>>
>>>
>>>
>
>
>

Re: [core-user][reduce seems to run only on one machine]

Posted by Ted Dunning <td...@veoh.com>.

The default number of reducers is 4.  It is unlikely that a user who doesn't
know about how to set the number of reducers has changed that value.

This phenomenon of apparently having only a single reducer often happens if
you have a very skewed distribution of keys for the reduce phase.

Imagine that you are counting words, but the text is almost entirely made up
of a single word.  If you don't have a combiner, then all instances of that
word will go to a single reducer.  The other reducers will finish instantly
so you may not even notice that they ran.

You have several options:

A) fix your program.  Most of the times that I have seen this, it was due to
a mistake on my part where I was outputting the wrong value as the reduce
key.  I shouldn't admit this, but it is true.

B) fix your program.  Many times, you can use a combiner to do a lot of the
work of the reducer as the data is emitted from the mapper.  This works is
the reducer could be done in pieces by summarizing the pieces so that the
summarization is done by the combiners (very much in parallel) and the
actual reducer only combines the summary (hopefully very quick, even if not
in parallel).

C) fix your problem.  Sometimes, your problem as stated is not particularly
appropriate for parallel execution, but there is a nearly equivalent
statement that is just fine.  For example, suppose that your input contains
a large list of keys and numbers and you want to compute the median of these
numbers for each key.  There is no sufficient statistic for computing the
(exact) median so you can't really use a combiner and if one key dominates,
your program will run slowly.

On the other hand, if you restate the problem to compute the *approximate*
median, then you could use a combiner.  For instance, suppose the combiner
finds the median of the values it is given and emits the median and the
count of samples it processed.  Then the reducer can compute the median of
the partial medians, respecting the counts in the process.  This results is
*not* the median and your program will be *wrong*.  But not by much.  If you
can accept the error (and it will probably be quite small), then you win.
If you can't accept this error, then option (D) is for you.

D) give up.  Your problem may not be appropriate for map-reduce execution or
you may be unable to afford the effort required to restate your problem in a
way that will work.  This may not be a bad option ... your program may not
even need parallelism.  Be sure to not overstate the situation.  Don't say
"my problem is impossible to parallelize" since some clever Jack is likely
to come along 10 minutes later and make you look the fool.  Instead say "My
problem appeared to be difficult to parallelize so I punted".

On 3/21/08 8:05 AM, "Amar Kamat" <am...@yahoo-inc.com> wrote:

> On Fri, 21 Mar 2008, Jean-Pierre OCALAN wrote:
> 
>> Hi,
>> 
>> I'm currently working on a project that implies massive log parsing. I have
>> one master and 6 slaves.
>> By looking the each slaves logs I've noticed that REDUCE operation just runs
>> on one machine.
>> So does that mean that reduce just runs on one machine ? And if that is true
>> how can I specify that I want the reduce to run also on the other machines ?
> You can set it in your job using JobConf.setNumReduceTasks(numReducers).
> How are you submitting your job? Is it from examples or your own code?
> Amar
>> 
>> Thanks for any help,
>> 
>> Jean-Pierre.
>> 
>> 
>>

Re: [core-user][reduce seems to run only on one machine]

Posted by Amar Kamat <am...@yahoo-inc.com>.

On Fri, 21 Mar 2008, Jean-Pierre OCALAN wrote:

> Hi,
>
> I'm currently working on a project that implies massive log parsing. I have
> one master and 6 slaves.
> By looking the each slaves logs I've noticed that REDUCE operation just runs
> on one machine.
> So does that mean that reduce just runs on one machine ? And if that is true
> how can I specify that I want the reduce to run also on the other machines ?
You can set it in your job using JobConf.setNumReduceTasks(numReducers).
How are you submitting your job? Is it from examples or your own code?
Amar
>
> Thanks for any help,
>
> Jean-Pierre.
>
>
>