You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Brian Feeny <bf...@mac.com> on 2013/04/16 02:09:09 UTC

How to use Naive Bayes Classifier to classify new data?

I am using Mahout version .7

I have used the complementary naive bayes classifier to classify basic spam/ham messages like so:

Copy easy_ham and spam directories into 20news-all:
 cp -R easy_ham/ spam/ 20news-all/

Copy 20news-all to HDFS:
hadoop fs -put 20news-all

Prepare data by sequencing into vectors:
 mahout seqdirectory -i 20news-all -o 20news-seq
 mahout seq2sparse -i 20news-seq -o 20news-vectors  -lnorm -nv  -wt tfidf

Split data into train and test sets with 20% of the data being used for test and 80% for train:
mahout split -i 20news-vectors/tfidf-vectors --trainingOutput 20news-train-vectors --testOutput 20news-test-vectors --randomSelectionPct 20 --overwrite --sequenceFiles -xm sequential

Build the model:
mahout trainnb -i 20news-train-vectors -el -o model -li labelindex -ow -c

You can test the model against the training set:
mahout testnb -i 20news-train-vectors -m model -l labelindex -ow -o 20news-testing-train -c

Now test against the test set:
mahout testnb -i 20news-test-vectors -m model -l labelindex -ow -o 20news-testing-test -c


This all works fine, I get good results with my Confusion Matrix output.

Now what if I have a message called message.txt.  How would I pass this to my data model to see if it classifies it as spam or ham?

Re: How to use Naive Bayes Classifier to classify new data?

Posted by Robin Anil <ro...@gmail.com>.

There are things you should know.

   1. Seq2sparse combines train and test file to create a single
   dictionary. If you have a new file you need to create the vectors from
   text, you need to reuse that dictionary. Other wise the ids in the vector
   that is created by that program will be different if you run seq2sparse
   only on a new dataset. So I would recommend staying away from it.
   2. First you should re-run this experiment using seq2encoded. This
   program uses a hash function(murmur2)  to encode the text to vectors. So if
   you re-run using a new dataset it will create a consistent vector.
   3. Once thats done, run seq2encoded on a directory of text documents
   that are not seen (which includes your message.txt among others). and run
   testnb on it.

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.

On Mon, Apr 15, 2013 at 7:09 PM, Brian Feeny <bf...@mac.com> wrote:

> I am using Mahout version .7
>
> I have used the complementary naive bayes classifier to classify basic
> spam/ham messages like so:
>
> Copy easy_ham and spam directories into 20news-all:
>  cp -R easy_ham/ spam/ 20news-all/
>
> Copy 20news-all to HDFS:
> hadoop fs -put 20news-all
>
> Prepare data by sequencing into vectors:
>  mahout seqdirectory -i 20news-all -o 20news-seq
>  mahout seq2sparse -i 20news-seq -o 20news-vectors  -lnorm -nv  -wt tfidf
>
> Split data into train and test sets with 20% of the data being used for
> test and 80% for train:
> mahout split -i 20news-vectors/tfidf-vectors --trainingOutput
> 20news-train-vectors --testOutput 20news-test-vectors --randomSelectionPct
> 20 --overwrite --sequenceFiles -xm sequential
>
> Build the model:
> mahout trainnb -i 20news-train-vectors -el -o model -li labelindex -ow -c
>
> You can test the model against the training set:
> mahout testnb -i 20news-train-vectors -m model -l labelindex -ow -o
> 20news-testing-train -c
>
> Now test against the test set:
> mahout testnb -i 20news-test-vectors -m model -l labelindex -ow -o
> 20news-testing-test -c
>
>
> This all works fine, I get good results with my Confusion Matrix output.
>
> Now what if I have a message called message.txt.  How would I pass this to
> my data model to see if it classifies it as spam or ham?
>
>
>