You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by "W.P. McNeill" <bi...@gmail.com> on 2011/03/29 20:18:17 UTC

How do I increase mapper granularity?

I'm running a job whose mappers take a long time, which causes problems like
starving out other jobs that want to run on the same cluster.  Rewriting the
mapper algorithm is not currently an option, but I still need a way to
increase the number of mappers so that I will have greater granularity.
 What is the best way to do this?

Looking through the O'Reilly book and starting from
this<http://wiki.apache.org/hadoop/HowManyMapsAndReduces>Wiki page
I've come up with a couple of ideas:

   1. Set mapred.map.tasks to the value I want.
   2. Decrease the block size of my input files.

What are the gotchas with these approaches?  I know that (1) may not work
because this parameter is just a suggestion.  Is there a command line option
that accomplishes (2), or do I have to do a distcp with a non-default block
size.  (I think the answer is that I have to do a distcp, but I'm making
sure.)

Are there other approaches?  Are there other gotchas that come with trying
to increase mapper granularity.  I know this can be more of an art than a
science.

Thanks.

Re: How do I increase mapper granularity?

Posted by Harsh J <qw...@gmail.com>.
Hello,

On Tue, Mar 29, 2011 at 11:48 PM, W.P. McNeill <bi...@gmail.com> wrote:
>   2. Decrease the block size of my input files.
> do I have to do a distcp with a non-default block
> size.  (I think the answer is that I have to do a distcp, but I'm making
> sure.)

A distcp or even a plain "-cp" with a proper -Ddfs.blocksize=<size>
parameter passed along should do the trick.

> Are there other approaches?

You can have a look at schedulers that guarantee resources to a
submitted job, perhaps?

> Are there other gotchas that come with trying
> to increase mapper granularity.

One thing that comes to my mind is that the more your splits are
(a.k.a. # of tasks), the more meta info the JobTracker has to hold and
maintain upon in its memory. Second, your NameNode also needs to hold
higher amount of bytes in memory for every such granulared set of
files (since lowering block sizes would lead to a LOT more block info
and replica locations to keep track of).

-- 
Harsh J
http://harshj.com