You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Vincent Etter <vi...@gmail.com> on 2013/02/14 15:51:26 UTC

Getting many Child Error : Could not reserve enough space for object heap

Dear all,

I am setting up and configuring a small Hadoop cluster of 11 nodes for
teaching purposes. All machines are identical, and have the following specs:

   - 4-core Intel(R) Xeon(R) CPU E3-1270 (3.5 GHz)
   - 16 GB of RAM
   - Debian Squeeze

I use a version of Hadoop 0.20.2 packaged by
Cloudera (hadoop-0.20.2-cdh3u5).

The significant configuration options I changed are:

   - mapred.tasktracker.map.tasks.maximum : 4
   - mapred.tasktracker.reduce.tasks.maximum : 2
   - mapred.child.java.opts : -Xmx1500m
   - mapred.child.ulimit : 4500000
   - io.sort.mb : 200
   - io.sort.factor : 64
   - io.file.buffer.size : 65536
   - mapred.jobtracker.taskScheduler
   : org.apache.hadoop.mapred.FairScheduler
   - mapred.reduce.tasks : 10
   - mapred.reduce.parallel.copies : 10
   - mapred.reduce.slowstart.completed.maps : 0.8

Most of these values were taken from the "Hadoop Operations" book.

My problem is the following: when running jobs on the cluster, I often get
the following errors in my mappers:

java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:250)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:237)

Error occurred during initialization of VM
Could not reserve enough space for object heap

I had at first a ulimit of 3000000, and then increased it to 4500000, with
no change. I don't understand why I get these memory errors: as I
understood, each node should use 1 + 1 + 4*1.5 + 2*1.5 = 11 GB of RAM at
most, leaving plenty of margin (the first 2 GB are for the TaskTracker and
DataNode processes).

Of course, no other software is running on these machines. The JobTracker
and NameNode are on two separated machines, not part of these 11 workers.

Do any of you have any advice on how I could prevent these errors from
happening? All jobs run fine though, it's just that these failures slow
things down a bit, and let me with the impression that I got something
wrong.

Are there any issues with my configuration options, given the hardware
specs of my machines?

Thanks in advance for any help/pointer!

Cheers,

Vincent