You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Justin Kay <jk...@easyesi.com> on 2014/02/20 21:15:21 UTC

OutOfMemoryError: Java Heap Space in DocumentProcessor.tokenizeDocuments

Hi everyone,

I've been stuck on an OutOfMemoryError when attempting to run a
SparseVectorsFromSequenceFiles() Job in Java. I'm using Mahout 0.9 and
Hadoop 2.2, run in a Maven project. I've tried setting the heap
configurations through Java using a Hadoop Configuration that is passed to
the Job:

CONF.set("mapreduce.map.memory.mb", "1536");
CONF.set("mapreduce.map.java.opts", "-Xmx1024m");
CONF.set("mapreduce.reduce.memory.mb", "1536");
CONF.set("mapreduce.reduce.java.opts", "-Xmx1024m");
CONF.set("task.io.sort.mb", "512");
CONF.set("task.io.sort.factor", "100");

etc., but nothing has seemed to work. My Java heap settings are similar and
configured to "-Xms512m -Xmx1536m" when running the project. The data I'm
using is 100,000 sequence files totally ~250mb. It doesn't fail on a data
set of 63 sequence files ~2mb. Here is an example stack trace:

Exception in thread "Thread-18" java.lang.OutOfMemoryError: Java heap space
at sun.util.resources.TimeZoneNames.getContents(TimeZoneNames.java:205)
at
sun.util.resources.OpenListResourceBundle.loadLookup(OpenListResourceBundle.java:125)
at
sun.util.resources.OpenListResourceBundle.loadLookupTablesIfNecessary(OpenListResourceBundle.java:113)
(this seems to get thrown on different bits of code every time)
......
java.lang.IllegalStateException: Job failed!
at
org.apache.mahout.vectorizer.DocumentProcessor.tokenizeDocuments(DocumentProcessor.java:95)
at
org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:257)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)

This is the code I'm running it with in order to pass in my own
Configuration:

SparseVectorsFromSequenceFiles VectorizeJob = new
SparseVectorsFromSequenceFiles();
VectorizeJob.setConf(CONF);
ToolRunner.run(VectorizeJob, args);, where args is a String[] of command
line options

Any suggestions would be greatly appreciated.

Justin Kay

Re: OutOfMemoryError: Java Heap Space in DocumentProcessor.tokenizeDocuments

Posted by Justin Kay <jk...@easyesi.com>.
Thanks, that seemed to help with passing in the parameters but I'm still
running into the same problem with the job. It's getting stuck on Map 0%
Reduce 0% when tokenizing documents (DocumentProcessor.tokenizeDocuments)
and then throws a "java.lang.OutOfMemoryError: GC overhead limit exceeded"
caused by running out of heap space. (I've tried running it with
the -XX:-UseGCOverheadLimit option and it just gives me the same Java heap
error.)

I've also tried running it with Hadoop 1.2.1 and Mahout 0.8 and had the
same problem.


On Sat, Feb 22, 2014 at 12:22 PM, Johannes Schulte <
johannes.schulte@gmail.com> wrote:

> 1I would pass the memory parameters in the args array directly. The hadoop
> specific arguments must come before your custom arguments, so like this
>
> String[] args = new String[]{"-Dmapreduce.map.memory.mb=12323","customOpt1"
> ToolRunner.run(..args)
>
> The tool runner takes care of putting the hadoop specific arguments in the
> jobs configs and. I bet the configuration you use is overridden or replaced
> by something else.
>
> Other than that, there is also
>
> job.getConfiguration().set("mapred.map.child.java.opts", "-Xmx2G");
>
>
> which works for me, but this is dependent on the hadoop version i guess.
>
>
>
>
> On Thu, Feb 20, 2014 at 9:15 PM, Justin Kay <jk...@easyesi.com> wrote:
>
> > Hi everyone,
> >
> > I've been stuck on an OutOfMemoryError when attempting to run a
> > SparseVectorsFromSequenceFiles() Job in Java. I'm using Mahout 0.9 and
> > Hadoop 2.2, run in a Maven project. I've tried setting the heap
> > configurations through Java using a Hadoop Configuration that is passed
> to
> > the Job:
> >
> > CONF.set("mapreduce.map.memory.mb", "1536");
> > CONF.set("mapreduce.map.java.opts", "-Xmx1024m");
> > CONF.set("mapreduce.reduce.memory.mb", "1536");
> > CONF.set("mapreduce.reduce.java.opts", "-Xmx1024m");
> > CONF.set("task.io.sort.mb", "512");
> > CONF.set("task.io.sort.factor", "100");
> >
> > etc., but nothing has seemed to work. My Java heap settings are similar
> and
> > configured to "-Xms512m -Xmx1536m" when running the project. The data I'm
> > using is 100,000 sequence files totally ~250mb. It doesn't fail on a data
> > set of 63 sequence files ~2mb. Here is an example stack trace:
> >
> > Exception in thread "Thread-18" java.lang.OutOfMemoryError: Java heap
> space
> > at sun.util.resources.TimeZoneNames.getContents(TimeZoneNames.java:205)
> > at
> >
> >
> sun.util.resources.OpenListResourceBundle.loadLookup(OpenListResourceBundle.java:125)
> > at
> >
> >
> sun.util.resources.OpenListResourceBundle.loadLookupTablesIfNecessary(OpenListResourceBundle.java:113)
> > (this seems to get thrown on different bits of code every time)
> > ......
> > java.lang.IllegalStateException: Job failed!
> > at
> >
> >
> org.apache.mahout.vectorizer.DocumentProcessor.tokenizeDocuments(DocumentProcessor.java:95)
> > at
> >
> >
> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:257)
> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> >
> > This is the code I'm running it with in order to pass in my own
> > Configuration:
> >
> > SparseVectorsFromSequenceFiles VectorizeJob = new
> > SparseVectorsFromSequenceFiles();
> > VectorizeJob.setConf(CONF);
> > ToolRunner.run(VectorizeJob, args);, where args is a String[] of command
> > line options
> >
> > Any suggestions would be greatly appreciated.
> >
> > Justin Kay
> >
>

Re: OutOfMemoryError: Java Heap Space in DocumentProcessor.tokenizeDocuments

Posted by Johannes Schulte <jo...@gmail.com>.
1I would pass the memory parameters in the args array directly. The hadoop
specific arguments must come before your custom arguments, so like this

String[] args = new String[]{"-Dmapreduce.map.memory.mb=12323","customOpt1"
ToolRunner.run(..args)

The tool runner takes care of putting the hadoop specific arguments in the
jobs configs and. I bet the configuration you use is overridden or replaced
by something else.

Other than that, there is also

job.getConfiguration().set("mapred.map.child.java.opts", "-Xmx2G");


which works for me, but this is dependent on the hadoop version i guess.




On Thu, Feb 20, 2014 at 9:15 PM, Justin Kay <jk...@easyesi.com> wrote:

> Hi everyone,
>
> I've been stuck on an OutOfMemoryError when attempting to run a
> SparseVectorsFromSequenceFiles() Job in Java. I'm using Mahout 0.9 and
> Hadoop 2.2, run in a Maven project. I've tried setting the heap
> configurations through Java using a Hadoop Configuration that is passed to
> the Job:
>
> CONF.set("mapreduce.map.memory.mb", "1536");
> CONF.set("mapreduce.map.java.opts", "-Xmx1024m");
> CONF.set("mapreduce.reduce.memory.mb", "1536");
> CONF.set("mapreduce.reduce.java.opts", "-Xmx1024m");
> CONF.set("task.io.sort.mb", "512");
> CONF.set("task.io.sort.factor", "100");
>
> etc., but nothing has seemed to work. My Java heap settings are similar and
> configured to "-Xms512m -Xmx1536m" when running the project. The data I'm
> using is 100,000 sequence files totally ~250mb. It doesn't fail on a data
> set of 63 sequence files ~2mb. Here is an example stack trace:
>
> Exception in thread "Thread-18" java.lang.OutOfMemoryError: Java heap space
> at sun.util.resources.TimeZoneNames.getContents(TimeZoneNames.java:205)
> at
>
> sun.util.resources.OpenListResourceBundle.loadLookup(OpenListResourceBundle.java:125)
> at
>
> sun.util.resources.OpenListResourceBundle.loadLookupTablesIfNecessary(OpenListResourceBundle.java:113)
> (this seems to get thrown on different bits of code every time)
> ......
> java.lang.IllegalStateException: Job failed!
> at
>
> org.apache.mahout.vectorizer.DocumentProcessor.tokenizeDocuments(DocumentProcessor.java:95)
> at
>
> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:257)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>
> This is the code I'm running it with in order to pass in my own
> Configuration:
>
> SparseVectorsFromSequenceFiles VectorizeJob = new
> SparseVectorsFromSequenceFiles();
> VectorizeJob.setConf(CONF);
> ToolRunner.run(VectorizeJob, args);, where args is a String[] of command
> line options
>
> Any suggestions would be greatly appreciated.
>
> Justin Kay
>