You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Gangadhar Nittala <np...@gmail.com> on 2010/09/13 00:11:14 UTC

Running Bayes classifier fills up disk space

All,

I am following the details given in the Mahout wiki to run the Bayes
example [ https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html
] with the 0.4 trunk code. I had to make a few modifications to the
commands to match the 0.4 snapshot, but when I run the Step 6 - to
train the classifier thus (I was able to get everything till Step 5
right), $HADOOP_HOME/bin/hadoop jar
$MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 3
--input wikipediainput10 --output wikipediamodel10 --classifierType
bayes --dataSource hdfs, the machine runs out of disk-space.

I did not run this for the complete enwiki-latest-pages-articles.xml
but only a part of the complete articles -
enwiki-latest-pages-articles10.xml.
[http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles10.xml.bz2].
Even with this, the HDFS fills up my 50 GB disk. Is this normal ? Does
the training of the classifier consume so much space ? Or is this
something that can be controlled via hadoop settings? I ask this
because, when I terminated the classifier process, stopped hadoop
(executed $HADOOP_HOME/bin/stop-all.sh) and checked the disk space, it
was back to what it was (around 43 GB free).

If the space usage is normal, is there a smaller set over which I can
run the classifier ? I want to see the output for the classifier
before I try to understand the code (also the intent was for me to
understand how to run Mahout algorithms and write example code).
Should I be asking these sort of questions on the mahout-users list ?

Thank you
Gangadhar

Re: Running Bayes classifier fills up disk space

Posted by Robin Anil <ro...@gmail.com>.
On Mon, Sep 13, 2010 at 3:41 AM, Gangadhar Nittala
<np...@gmail.com>wrote:

> All,
>
> I am following the details given in the Mahout wiki to run the Bayes
> example [ https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html
> ] with the 0.4 trunk code. I had to make a few modifications to the
> commands to match the 0.4 snapshot, but when I run the Step 6 - to
> train the classifier thus (I was able to get everything till Step 5
> right), $HADOOP_HOME/bin/hadoop jar
> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job
> org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 3
> --input wikipediainput10 --output wikipediamodel10 --classifierType
> bayes --dataSource hdfs, the machine runs out of disk-space.
>
> I did not run this for the complete enwiki-latest-pages-articles.xml
> but only a part of the complete articles -
> enwiki-latest-pages-articles10.xml.
> [
> http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles10.xml.bz2
> ].
> Even with this, the HDFS fills up my 50 GB disk. Is this normal ? Does
> the training of the classifier consume so much space ? Or is this
> something that can be controlled via hadoop settings? I ask this
> because, when I terminated the classifier process, stopped hadoop
> (executed $HADOOP_HOME/bin/stop-all.sh) and checked the disk space, it
> was back to what it was (around 43 GB free).
>

Yes, for now the classifier doesn't delete intermediate files. The final
model is much smaller < 1GB

>
> If the space usage is normal, is there a smaller set over which I can
> run the classifier ? I want to see the output for the classifier
> before I try to understand the code (also the intent was for me to
> understand how to run Mahout algorithms and write example code).
> Should I be asking these sort of questions on the mahout-users list ?
>

Try to use the WikipediaDatasetCreator to select articles from a given
category list. See the code for more details

>
> Thank you
> Gangadhar
>