You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Marek Bachmann <m....@uni-kassel.de> on 2011/12/06 00:37:26 UTC

Problems with HeapSpace in Hadoop Cluster

Hello folks,

excuse me if my topic is half off topic, because it regards mostly the
hadoop setup, but perhaps someone had the same problems already.

The situation is as follows:

I am merging the segments together after every crawl cycle for
calculating the WebGraph easily.

At the moment I have three segements and two of them are very large (but
I think not in the matter of hadoop ;-) )

The first seg is 22,9 GB, the second seg is 23,3 GB, and the third seg
is (only) 528 MB.

When I try to merge this this three segs together the job crashes after
a while after a couple of heapSpace errors, see beneath

11/12/06 00:20:53 INFO mapred.JobClient: Task Id :
attempt_201112052355_0002_r_000003_0, Status : FAILED
Error: java.lang.OutOfMemoryError: Java heap space
        at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
        at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
        at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
        at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)

My Hadoop-Cluster consists of a master node with 4 GB Ram and a dual
core CPU with 2 GHz per core.
There are five identical slaves with 1.5 GB Ram and dual core cpus with
3.0 GHz per core.

I set HADOOP_HEAPSIZE to 1500 MB and in mapred-site.xml:

  <property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx512m</value>
    <description>
      You can specify other Java options for each map or reduce task here,
      but most likely you will want to adjust the heap size.
    </description>
  </property>

  <property>
    <name>mapred.map.tasks</name>
    <value>50</value>
    <description>
      define mapred.map tasks to be number of slave hosts
    </description>
  </property>

  <property>
    <name>mapred.reduce.tasks</name>
    <value>6</value>
    <description>
      define mapred.reduce tasks to be number of slave hosts
    </description>
  </property>

But with this value I can't run the merger with about failing the job
because of the HeapSpace errors. Any idea if I could solve the problem
by adjusting the configuration, or do I just need more RAM for this job?

Thank you very much!

Marek

Re: Problems with HeapSpace in Hadoop Cluster

Posted by Markus Jelsma <ma...@openindex.io>.
Do your unmerged segments contain a lot of duplicates? If not, then don't 
merge. It is not required anymore and takes a lot of time. Technically there 
is no reason to merge segments and the WebGraph problem already has a -
segmentDir option in 1.4.

Otherwise, increase mapper heap space and decrease reducer head space.

> Hello folks,
> 
> excuse me if my topic is half off topic, because it regards mostly the
> hadoop setup, but perhaps someone had the same problems already.
> 
> The situation is as follows:
> 
> I am merging the segments together after every crawl cycle for
> calculating the WebGraph easily.
> 
> At the moment I have three segements and two of them are very large (but
> I think not in the matter of hadoop ;-) )
> 
> The first seg is 22,9 GB, the second seg is 23,3 GB, and the third seg
> is (only) 528 MB.
> 
> When I try to merge this this three segs together the job crashes after
> a while after a couple of heapSpace errors, see beneath
> 
> 11/12/06 00:20:53 INFO mapred.JobClient: Task Id :
> attempt_201112052355_0002_r_000003_0, Status : FAILED
> Error: java.lang.OutOfMemoryError: Java heap space
>         at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInM
> emory(ReduceTask.java:1508) at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutp
> ut(ReduceTask.java:1408) at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput
> (ReduceTask.java:1261) at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(Reduce
> Task.java:1195)
> 
> My Hadoop-Cluster consists of a master node with 4 GB Ram and a dual
> core CPU with 2 GHz per core.
> There are five identical slaves with 1.5 GB Ram and dual core cpus with
> 3.0 GHz per core.
> 
> I set HADOOP_HEAPSIZE to 1500 MB and in mapred-site.xml:
> 
>   <property>
>     <name>mapred.child.java.opts</name>
>     <value>-Xmx512m</value>
>     <description>
>       You can specify other Java options for each map or reduce task here,
>       but most likely you will want to adjust the heap size.
>     </description>
>   </property>
> 
>   <property>
>     <name>mapred.map.tasks</name>
>     <value>50</value>
>     <description>
>       define mapred.map tasks to be number of slave hosts
>     </description>
>   </property>
> 
>   <property>
>     <name>mapred.reduce.tasks</name>
>     <value>6</value>
>     <description>
>       define mapred.reduce tasks to be number of slave hosts
>     </description>
>   </property>
> 
> But with this value I can't run the merger with about failing the job
> because of the HeapSpace errors. Any idea if I could solve the problem
> by adjusting the configuration, or do I just need more RAM for this job?
> 
> Thank you very much!
> 
> Marek