You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Nora Olsen <no...@gmail.com> on 2012/08/22 09:45:08 UTC

Using CNaiveBayes in mahout 0.7 without command line

Hi,

I can prepare the model and test via using the command line:

./bin/mahout seqdirectory -i /tmp/articles -o /tmp/articles-seq

./bin/mahout seq2sparse -i /tmp/articles-seq -o /tmp/articles-vectors
-lnorm -nv -wt tfidf

./bin/mahout split -i /tmp/articles-vectors/tfidf-vectors
--trainingOutput /tmp/articles-train-vectors \

    --testOutput /tmp/articles-test-vectors --randomSelectionPct 40 \

    --overwrite --sequenceFiles -xm sequential

./bin/mahout trainnb -i /tmp/articles-train-vectors -el \

    -o /tmp/model -li /tmp/labelindex -ow -c

./bin/mahout testnb -i /tmp/articles-train-vectors -m /tmp/model \

    -l /tmp/labelindex -ow -o /tmp/articles-testing -c

./bin/mahout testnb -i /tmp/articles-test-vectors -m /tmp/model \

    -l /tmp/labelindex -ow -o /tmp/articles-testing -c


However, I need to be able to call "classifier.classifyFull()"
directly in my program instead of the command line argument.

Looking at the source code of SparseVectorsFromSequenceFiles, I am
unsure how to convert a text document into a vector that can be used.

Here's an example code that I have and the results are bad, i.e. > 80%
wrongly classified, compared when running via the command line.

LuceneTextValueEncoder enc = new LuceneTextValueEncoder("text");
Analyzer analyzer = ClassUtils.instantiateAs(analyzerClass, Analyzer.class);
enc.setAnalyzer(analyzer);
Vector features = new RandomAccessSparseVector(10000);
enc.addToVector(text, features);
Vector classifierResults = classifier.classifyFull(features);

Previously in mahout 0.6, I could use  cnb in the bayes package easily
by calling the "classifyDocument" method.

Thanks,
Nora