You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Peter Wolf <op...@gmail.com> on 2011/06/28 22:43:39 UTC

How does Hadoop manage memory?

Hello all,

I am looking for the right thing to read...

I am writing a MapReduce Speech Recognition application.  I want to run 
many Speech Recognizers in parallel.

Speech Recognizers not only use a large amount of processor, they also 
use a large amount of memory.  Also, in my application, they are often 
idle much of the time waiting for data.  So optimizing what runs when is 
non-trivial.

I am trying to better understand how Hadoop manages resources.  Does it 
automatically figure out the right number of mappers to instantiate?  
How?  What happens when other people are sharing the cluster?  What 
resource management is the responsibility of application developers?

For example, let's say each Speech Recognizer uses 500 MB, and I have 
1,000,000 files to process.  What would happen if I made 1,000,000 
mappers, each with 1 Speech Recognizer?  Is it only non-optimal because 
of setup time, or would the system try to allocate 500GB of memory and 
explode?

Thank you in advance
Peter


Re: How does Hadoop manage memory?

Posted by Allen Wittenauer <aw...@apache.org>.
On Jun 28, 2011, at 1:43 PM, Peter Wolf wrote:

> Hello all,
> 
> I am looking for the right thing to read...
> 
> I am writing a MapReduce Speech Recognition application.  I want to run many Speech Recognizers in parallel.
> 
> Speech Recognizers not only use a large amount of processor, they also use a large amount of memory.  Also, in my application, they are often idle much of the time waiting for data.  So optimizing what runs when is non-trivial.
> 
> I am trying to better understand how Hadoop manages resources.  Does it automatically figure out the right number of mappers to instantiate?

	The number of mappers correlates to the number of InputSplits, which is based upon the InputFormat.  In most cases, this is equivalent to the number of blocks.  So if a file is composed of 3 blocks, it will generate 3 mappers.  Again, depending upon the InputFormat, the size of these splits may be manipulated via job settings.


>  How?  What happens when other people are sharing the cluster?  What resource management is the responsibility of application developers?

	Realistically, *all* resource management is the responsibility of the operations and development teams.  The only real resource protection/allocation system that Hadoop provides is task slots and, if enabled, some memory protection in the form of "don't go over this much".    On multi-tenant systems, a good neighbor view of the world should be adopted.

> For example, let's say each Speech Recognizer uses 500 MB, and I have 1,000,000 files to process.  What would happen if I made 1,000,000 mappers, each with 1 Speech Recognizer?  

	At 1m mappers, the JobTracker would likely explode under the weight first unless the Heap size was raised significantly.  Each value that you see on the JT page--including those for each task--are kept in main memory.  


> Is it only non-optimal because of setup time, or would the system try to allocate 500GB of memory and explode?

	If you have 1m map slots, yes, it would allocate .5TB of mem spread across each node.