You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Frank Wang <wa...@gmail.com> on 2010/12/08 12:02:51 UTC

Naive Bayes testclassifier Java heap space exception

Hi, I was trying out Naive Bayes with a setup similar to 20NewsGroup setup.
There are 5 categories, each category with 150 articles, and each article is
about 50~150kb in size.

Training was successful:
$MAHOUT_HOME/bin/mahout trainclassifier   -i news-input   -o news-model
-type bayes   -ng 3   -source hdfs

However, Testing Classifier always generate this exception:
$MAHOUT_HOME/bin/mahout testclassifier   -m news-model   -d news-input
-type bayes   -ng 3   -source hdfs   -method mapreduce
http://pastie.org/1358465

I tried to give more memory to map-reduce worker in conf/mapred-site.xml,
(tried 256m, 512m and 1G), but no luck.
  <property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx1G</value>
  </property>

In 'top', memory usage for 2 Java processes would rise up to 1.0GB and then
TestClassifier crashes.

Are my articles too large in size?
Has anyone experienced this?

Re: Naive Bayes testclassifier Java heap space exception

Posted by Frank Wang <wa...@gmail.com>.
Hi David,

Thanks for your reply.

I just check, the docs are 39MB and models are 301MB. I'm running on a
single node with pseudo cluster setup. I think giving the hadoop worker 1GB
of memory should be more than enough.

Am I missing something here?


On Wed, Dec 8, 2010 at 3:24 AM, David Hagar <da...@occamlaw.com> wrote:

> Hi Frank --
>
> One major caveat to the below: I've hacked the 0.4 distribution of
> Mahout quite a bit to get Naive Bayes running smoothly on Amazon's s3
> and elastic mapreduce services, thus my experience may not be typical
> and the memory problems I ran into might well be of my own making.
>
> That said I had to allocate between 2-3gb per map task to run Naive
> Bayes classification. The classification job loads pretty much every
> file in the training model into memory, so you can get some estimate
> of size by looking at the size of your model directory. Also, it did
> seem to me that each map task was holding onto each document it
> processed. So, each 100-150KB doc stays in memory after it has been
> classified.
>
> I temporarily resolved this by increasing the number of map tasks so
> that each task handled fewer documents, thus fewer documents stayed
> around in memory. Obviously a better fix would be to figure out why
> they are being held onto in the first place (or if the steady increase
> in memory was being introduced by something else).
>
> As I said, the problem may be with my local version of mahout, but
> that was my experience.
>
> -David
>
>
> On Wed, Dec 8, 2010 at 3:02 AM, Frank Wang <wa...@gmail.com> wrote:
> > Hi, I was trying out Naive Bayes with a setup similar to 20NewsGroup
> setup.
> > There are 5 categories, each category with 150 articles, and each article
> is
> > about 50~150kb in size.
> >
> > Training was successful:
> > $MAHOUT_HOME/bin/mahout trainclassifier   -i news-input   -o news-model
> > -type bayes   -ng 3   -source hdfs
> >
> > However, Testing Classifier always generate this exception:
> > $MAHOUT_HOME/bin/mahout testclassifier   -m news-model   -d news-input
> > -type bayes   -ng 3   -source hdfs   -method mapreduce
> > http://pastie.org/1358465
> >
> > I tried to give more memory to map-reduce worker in conf/mapred-site.xml,
> > (tried 256m, 512m and 1G), but no luck.
> >  <property>
> >    <name>mapred.child.java.opts</name>
> >    <value>-Xmx1G</value>
> >  </property>
> >
> > In 'top', memory usage for 2 Java processes would rise up to 1.0GB and
> then
> > TestClassifier crashes.
> >
> > Are my articles too large in size?
> > Has anyone experienced this?
> >
>

Re: Naive Bayes testclassifier Java heap space exception

Posted by David Hagar <da...@occamlaw.com>.
Hi Frank --

One major caveat to the below: I've hacked the 0.4 distribution of
Mahout quite a bit to get Naive Bayes running smoothly on Amazon's s3
and elastic mapreduce services, thus my experience may not be typical
and the memory problems I ran into might well be of my own making.

That said I had to allocate between 2-3gb per map task to run Naive
Bayes classification. The classification job loads pretty much every
file in the training model into memory, so you can get some estimate
of size by looking at the size of your model directory. Also, it did
seem to me that each map task was holding onto each document it
processed. So, each 100-150KB doc stays in memory after it has been
classified.

I temporarily resolved this by increasing the number of map tasks so
that each task handled fewer documents, thus fewer documents stayed
around in memory. Obviously a better fix would be to figure out why
they are being held onto in the first place (or if the steady increase
in memory was being introduced by something else).

As I said, the problem may be with my local version of mahout, but
that was my experience.

-David


On Wed, Dec 8, 2010 at 3:02 AM, Frank Wang <wa...@gmail.com> wrote:
> Hi, I was trying out Naive Bayes with a setup similar to 20NewsGroup setup.
> There are 5 categories, each category with 150 articles, and each article is
> about 50~150kb in size.
>
> Training was successful:
> $MAHOUT_HOME/bin/mahout trainclassifier   -i news-input   -o news-model
> -type bayes   -ng 3   -source hdfs
>
> However, Testing Classifier always generate this exception:
> $MAHOUT_HOME/bin/mahout testclassifier   -m news-model   -d news-input
> -type bayes   -ng 3   -source hdfs   -method mapreduce
> http://pastie.org/1358465
>
> I tried to give more memory to map-reduce worker in conf/mapred-site.xml,
> (tried 256m, 512m and 1G), but no luck.
>  <property>
>    <name>mapred.child.java.opts</name>
>    <value>-Xmx1G</value>
>  </property>
>
> In 'top', memory usage for 2 Java processes would rise up to 1.0GB and then
> TestClassifier crashes.
>
> Are my articles too large in size?
> Has anyone experienced this?
>