You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Nora Olsen <no...@gmail.com> on 2012/08/22 09:45:08 UTC
Using CNaiveBayes in mahout 0.7 without command line
Hi,
I can prepare the model and test via using the command line:
./bin/mahout seqdirectory -i /tmp/articles -o /tmp/articles-seq
./bin/mahout seq2sparse -i /tmp/articles-seq -o /tmp/articles-vectors
-lnorm -nv -wt tfidf
./bin/mahout split -i /tmp/articles-vectors/tfidf-vectors
--trainingOutput /tmp/articles-train-vectors \
--testOutput /tmp/articles-test-vectors --randomSelectionPct 40 \
--overwrite --sequenceFiles -xm sequential
./bin/mahout trainnb -i /tmp/articles-train-vectors -el \
-o /tmp/model -li /tmp/labelindex -ow -c
./bin/mahout testnb -i /tmp/articles-train-vectors -m /tmp/model \
-l /tmp/labelindex -ow -o /tmp/articles-testing -c
./bin/mahout testnb -i /tmp/articles-test-vectors -m /tmp/model \
-l /tmp/labelindex -ow -o /tmp/articles-testing -c
However, I need to be able to call "classifier.classifyFull()"
directly in my program instead of the command line argument.
Looking at the source code of SparseVectorsFromSequenceFiles, I am
unsure how to convert a text document into a vector that can be used.
Here's an example code that I have and the results are bad, i.e. > 80%
wrongly classified, compared when running via the command line.
LuceneTextValueEncoder enc = new LuceneTextValueEncoder("text");
Analyzer analyzer = ClassUtils.instantiateAs(analyzerClass, Analyzer.class);
enc.setAnalyzer(analyzer);
Vector features = new RandomAccessSparseVector(10000);
enc.addToVector(text, features);
Vector classifierResults = classifier.classifyFull(features);
Previously in mahout 0.6, I could use cnb in the bayes package easily
by calling the "classifyDocument" method.
Thanks,
Nora