You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Chris Dyer <re...@umd.edu> on 2009/09/15 07:42:45 UTC

Best practices with large-memory jobs

Hello Hadoopers-
I'm attempting to run some large-memory map tasks with using hadoop
streaming, but I seem to be running afoul of the mapred.child.ulimit
restriction, which is set to 2097152.  I assume this is in KB since my
tasks fail when they get to about 2GB (I just need to get to about
2.3GB- almost there!).  So far, nothing I've tried has succeeded in
changing this value.   I've attempted to add
-jobconf mapred.child.ulimt=3000000
to the streaming command line, but to no avail.  In the job's xml file
that I find in my logs, it's still got the old value.  And worse, in
my task logs I see the message:
"attempt to override final parameter: mapred.child.ulimit;  Ignoring."
which doesn't exactly inspire confidence that I'm on the right path.

I see there's been a fair amount of traffic on Jira about large memory
jobs, but there doesn't seem to be much in the way of examples or
documentation.  Can someone tell me how to run such a job, especially
a streaming job?

Many thanks in advance--
Chris
ps. I'm running an 18.3 cluster on Amazon EC2 (I've been using the
Cloudera convenience scripts, but I can abandon this if I need more
control).  The instances have plenty of memory (7.5GB).

Re: Best practices with large-memory jobs

Posted by Steve Loughran <st...@apache.org>.

Chris Dyer wrote:
>>> my task logs I see the message:
>>> "attempt to override final parameter: mapred.child.ulimit;  Ignoring."
>>> which doesn't exactly inspire confidence that I'm on the right path.
>> Chances are the param has been marked final in the task tracker's running
>> config which will prevent you overriding the value with a job specific
>> configuration.
> Do you have any idea how one unmarks such a thing?  Do I just need to
> edit the configuration file for the task tracker?
> 
>> Depending upon how many tasks per node, that may not be enough. Streaming
>> jobs eat a crapton (I'm pretty sure that is an SI unit) of memory.  If you
> Is there any particular reason for the excessive memory use?  I
> realize this is Java, but it's just sloshing data down to my
> processes...
> 

Java6u14 + lets you run with "compressed pointers"; everyone is still 
playing with that but it does appear to reduce 64-bit  memory use. If 
you were using 32 bit JVMs, stay with them, as even with compressed 
pointers, 64 bit JVMs use more memory per object instances.

> How does one change the number of map slots per
> node?  I'm a hadoop configuration newbie (which is why I was
> originally excited about the Cloudera EC2 scripts...)

 From the code in front of my IDE

    maxMapSlots = conf.getInt("mapred.tasktracker.map.tasks.maximum", 2);
     maxReduceSlots = 
conf.getInt("mapred.tasktracker.reduce.tasks.maximum", 2);

Those are conf values you have to tune.

Re: Best practices with large-memory jobs

Posted by Chris Dyer <re...@umd.edu>.

>> my task logs I see the message:
>> "attempt to override final parameter: mapred.child.ulimit;  Ignoring."
>> which doesn't exactly inspire confidence that I'm on the right path.
>
> Chances are the param has been marked final in the task tracker's running
> config which will prevent you overriding the value with a job specific
> configuration.
Do you have any idea how one unmarks such a thing?  Do I just need to
edit the configuration file for the task tracker?

>
> Depending upon how many tasks per node, that may not be enough. Streaming
> jobs eat a crapton (I'm pretty sure that is an SI unit) of memory.  If you
Is there any particular reason for the excessive memory use?  I
realize this is Java, but it's just sloshing data down to my
processes...

> are hitting 2gb+, that means you can probably run 3 tasks max without
> swapping.  [Don't forget to count the size of the task tracker JVM, the
> streaming.jar JVM, etc, and be cognizant of the fact that JVM mem size !=
> Java heap size.]
I'm seeing the failures even when I run a single job.  But, obviously
I don't want to schedule more than 3 jobs on a node since they won't
have enough memory.  How does one change the number of map slots per
node?  I'm a hadoop configuration newbie (which is why I was
originally excited about the Cloudera EC2 scripts...)

-Chris

Re: Best practices with large-memory jobs

Posted by Allen Wittenauer <aw...@linkedin.com>.

On 9/14/09 10:42 PM, "Chris Dyer" <re...@umd.edu> wrote:
> And worse, in
> my task logs I see the message:
> "attempt to override final parameter: mapred.child.ulimit;  Ignoring."
> which doesn't exactly inspire confidence that I'm on the right path.

Chances are the param has been marked final in the task tracker's running
config which will prevent you overriding the value with a job specific
configuration.

> ps. I'm running an 18.3 cluster on Amazon EC2 (I've been using the
> Cloudera convenience scripts, but I can abandon this if I need more
> control).  The instances have plenty of memory (7.5GB).

Depending upon how many tasks per node, that may not be enough. Streaming
jobs eat a crapton (I'm pretty sure that is an SI unit) of memory.  If you
are hitting 2gb+, that means you can probably run 3 tasks max without
swapping.  [Don't forget to count the size of the task tracker JVM, the
streaming.jar JVM, etc, and be cognizant of the fact that JVM mem size !=
Java heap size.]