You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Tharindu Rusira <th...@gmail.com> on 2014/03/18 14:22:56 UTC

Naive Bayes classification

Hi everyone,
I'm developing an application where I need to train a Naive Bayes
classification model and use this model to classify new entities(In this
case text files based on their content)

I observed that seqdirectory command always adds the file/directory name as
the "key" field for each document which will be used as the label in
classification jobs.
This makes sense when I need to train a model and create the labelindex
since I have organized my training data according to their labels in
separate directories.

Now I'm trying to use this model and infer the best label for an unknown
document.
My requirement is to ask Mahout to read my new file and output the
predicted category by looking at the labelindex and the tfidf vector of the
new content.
I tried creating vectors from the new content (seqdirectory and
seq2sparse), and then using this vector to run testnb command. But
unfortunately seqdirectory commands adds file names as labels which does
not make sense in classification.

The following error message will further demonstrate this behavior.
imput0.txt is the file name of my new document.

[main] ERROR com.me.classifier.mahout.MahoutClassifier - Error while
classifying documents
java.lang.IllegalArgumentException: Label not found: input0.txt
    at
com.google.common.base.Preconditions.checkArgument(Preconditions.java:125)
    at
org.apache.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:182)
    at
org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:205)
    at
org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:209)
    at
org.apache.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:173)
    at
org.apache.mahout.classifier.ResultAnalyzer.addInstance(ResultAnalyzer.java:70)
    at
org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.analyzeResults(TestNaiveBayesDriver.java:160)
    at
org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.run(TestNaiveBayesDriver.java:125)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at
org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.main(TestNaiveBayesDriver.java:66)


So how can I achieve what I'm trying to do here?

Thanks,


-- 
M.P. Tharindu Rusira Kumara

Department of Computer Science and Engineering,
University of Moratuwa,
Sri Lanka.
+94757033733
www.tharindu-rusira.blogspot.com

Re: Naive Bayes classification

Posted by Tharindu Rusira <th...@gmail.com>.
Hi, first of all I'm sorry that my previous mail was vague and poorly
formulated.
Yes, Suneel got exactly what I was asking.Both  options will address my
requirement.
Thanks a lot.
-Tharindu
On Mar 19, 2014 8:51 AM, "Suneel Marthi" <su...@yahoo.com> wrote:

> Tharindu,
>
> If I understand what u r trying to do:-
>
> a) You have a trained Bayes model.
> b) You would like to classify new documents using this trained model.
> c) You were trying to use TestNaiveBayesDriver to classify the documents
> in (b).
>
> Option 1:
> -----------
>
> You could write a custom MapReduce job that creates sequence files from
> the documents (without the label key). You could feed these sequencefiles
> to seq2sparse to generate ur vectors -> call TestNAiveBayes with this
> input. Let me know if u need code for the earlier part.
>
>
> Option 2:
> -----------
> Work with your existing tf-idf vectors generated from seqdirectory ->
> seq2sparse.  But instead of invoking Mahout TestNaiveBayes, create a custom
> MapReduce job (or a plain java program if that's fine with u) that
> basically does the following:
>
> a) Instantiate a classifier with trained model: (Pseudo code below)
>
>  NaiveBayesModel naiveBayesModel = NaiveBayesModel.materialize(new
> Path(outputDir.getAbsolutePath()), conf);
>
>  AbstractVectorClassifier classifier = new
> StandardNaiveBayesClassifier(naiveBayesModel);
>
> // Parse through the input tf-idf vectors <Text, VectorWritable> and feed
> them to the classifier
>
> for (Pair<Text,VectorWritable> vector : new
> SequenceFileDirIterable<Text,VectorWritable>(getInputPath(), PathType.LIST,
>         PathFilters.logsCRCFilter(), null, true, conf)) {
>     // invoke the classifier on the incoming vector
>      Vector result = classifier.classifyFull(vector.getSecond().get());
>      context.write(record.getFirst(), new VectorWritable(result));
> }
>
> You can have the above code as part of a mapper in an MR job.
>
>
>
>
>
>
>
>
>
> On Tuesday, March 18, 2014 5:49 PM, Kevin Moulart <ke...@gmail.com>
> wrote:
>
> To use Naive Bayes you need a Sequence File <Text, VectorWritable> with the
> key formatted like this "label/label" for some reason I checked with the
> sources to be sure and it parses it looking for a '/'.
>
> When y used seqdirectory, it told Naive Bayes to classify the content of
> each file (ex : file1.txt) with the label corresponding to its name (here,
> file1.txt). So when you tried testing with input0.txt it failed because
> input0.txt was not a valid label.
>
> I designed a MapReduce java job that transforms a csv with numeric values
> into a proper SequenceFile, if you want you can take the source and update
> if to suit your need : https://github.com/kmoulart/hadoop_mahout_utils
>
> Good luck.
>
> Kévin Moulart
>
>
>
> 2014-03-18 20:13 GMT+01:00 Frank Scholten <fr...@frankscholten.nl>:
>
> > Hi Tharindu,
> >
> > If I understand correctly seqdirectory creates labels based on the file
> > name but this is not what you want. What do you want the labels to be?
> >
> > Cheers,
> >
> > Frank
> >
> >
> > On Tue, Mar 18, 2014 at 2:22 PM, Tharindu Rusira
> > <th...@gmail.com>wrote:
> >
> > > Hi everyone,
> > > I'm developing an application where I need to train a Naive Bayes
> > > classification model and use this model to classify new entities(In
> this
> > > case text files based on their content)
> > >
> > > I observed that seqdirectory command always adds the file/directory
> name
> > as
> > > the "key" field for each document which will be used as the label in
> > > classification jobs.
> > > This makes sense when I need to train a model and create the labelindex
> > > since I have organized my training data according to their labels in
> > > separate
>  directories.
> > >
> > > Now I'm trying to use this model and infer the best label for an
> unknown
> > > document.
> > > My requirement is to ask Mahout to read my new file and output the
> > > predicted category by looking at the labelindex and the tfidf vector of
> > the
> > > new content.
> > > I tried creating vectors from the new content (seqdirectory and
> > > seq2sparse), and then using this vector to run testnb command. But
> > > unfortunately seqdirectory commands adds file names as labels which
> does
> > > not make sense in classification.
> > >
> > > The following error message will further demonstrate this behavior.
> > > imput0.txt is the file name of my new document.
> > >
> > > [main] ERROR com.me.classifier.mahout.MahoutClassifier - Error while
> > > classifying documents
> > > java.lang.IllegalArgumentException: Label not found: input0.txt
> > >     at
> > >
> >
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:125)
> > >     at
> > >
> > >
> >
> org.apache.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:182)
> > >     at
> > >
> > >
> >
> org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:205)
> > >     at
> > >
> > >
> >
>
>  org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:209)
> > >     at
> > >
> > >
> >
> org.apache.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:173)
> > >     at
> > >
> > >
> >
> org.apache.mahout.classifier.ResultAnalyzer.addInstance(ResultAnalyzer.java:70)
> > >     at
> > >
> > >
> >
> org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.analyzeResults(TestNaiveBayesDriver.java:160)
> > >     at
> > >
> > >
> >
> org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.run(TestNaiveBayesDriver.java:125)
> > >
>  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > >     at
> > >
> > >
> >
> org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.main(TestNaiveBayesDriver.java:66)
> > >
> > >
> > > So how can I achieve what I'm trying to do here?
> > >
> > > Thanks,
> > >
> > >
> > > --
> > > M.P. Tharindu Rusira Kumara
> > >
> > > Department of Computer Science and Engineering,
> > > University of Moratuwa,
> > > Sri Lanka.
> > > +94757033733
> > > www.tharindu-rusira.blogspot.com
> > >
> >

Re: Naive Bayes classification

Posted by Suneel Marthi <su...@yahoo.com>.
Tharindu,

If I understand what u r trying to do:-

a) You have a trained Bayes model.
b) You would like to classify new documents using this trained model.
c) You were trying to use TestNaiveBayesDriver to classify the documents in (b).

Option 1:
-----------

You could write a custom MapReduce job that creates sequence files from the documents (without the label key). You could feed these sequencefiles to seq2sparse to generate ur vectors -> call TestNAiveBayes with this input. Let me know if u need code for the earlier part.


Option 2:
-----------
Work with your existing tf-idf vectors generated from seqdirectory -> seq2sparse.  But instead of invoking Mahout TestNaiveBayes, create a custom MapReduce job (or a plain java program if that's fine with u) that basically does the following:

a) Instantiate a classifier with trained model: (Pseudo code below)

 NaiveBayesModel naiveBayesModel = NaiveBayesModel.materialize(new Path(outputDir.getAbsolutePath()), conf);

 AbstractVectorClassifier classifier = new StandardNaiveBayesClassifier(naiveBayesModel);

// Parse through the input tf-idf vectors <Text, VectorWritable> and feed them to the classifier

for (Pair<Text,VectorWritable> vector : new SequenceFileDirIterable<Text,VectorWritable>(getInputPath(), PathType.LIST,         PathFilters.logsCRCFilter(), null, true, conf)) {
    // invoke the classifier on the incoming vector
     Vector result = classifier.classifyFull(vector.getSecond().get());
     context.write(record.getFirst(), new VectorWritable(result));
}

You can have the above code as part of a mapper in an MR job.









On Tuesday, March 18, 2014 5:49 PM, Kevin Moulart <ke...@gmail.com> wrote:
 
To use Naive Bayes you need a Sequence File <Text, VectorWritable> with the
key formatted like this "label/label" for some reason I checked with the
sources to be sure and it parses it looking for a '/'.

When y used seqdirectory, it told Naive Bayes to classify the content of
each file (ex : file1.txt) with the label corresponding to its name (here,
file1.txt). So when you tried testing with input0.txt it failed because
input0.txt was not a valid label.

I designed a MapReduce java job that transforms a csv with numeric values
into a proper SequenceFile, if you want you can take the source and update
if to suit your need : https://github.com/kmoulart/hadoop_mahout_utils

Good luck.

Kévin Moulart



2014-03-18 20:13 GMT+01:00 Frank Scholten <fr...@frankscholten.nl>:

> Hi Tharindu,
>
> If I understand correctly seqdirectory creates labels based on the file
> name but this is not what you want. What do you want the labels to be?
>
> Cheers,
>
> Frank
>
>
> On Tue, Mar 18, 2014 at 2:22 PM, Tharindu Rusira
> <th...@gmail.com>wrote:
>
> > Hi everyone,
> > I'm developing an application where I need to train a Naive Bayes
> > classification model and use this model to classify new entities(In this
> > case text files based on their content)
> >
> > I observed that seqdirectory command always adds the file/directory name
> as
> > the "key" field for each document which will be used as the label in
> > classification jobs.
> > This makes sense when I need to train a model and create the labelindex
> > since I have organized my training data according to their labels in
> > separate
 directories.
> >
> > Now I'm trying to use this model and infer the best label for an unknown
> > document.
> > My requirement is to ask Mahout to read my new file and output the
> > predicted category by looking at the labelindex and the tfidf vector of
> the
> > new content.
> > I tried creating vectors from the new content (seqdirectory and
> > seq2sparse), and then using this vector to run testnb command. But
> > unfortunately seqdirectory commands adds file names as labels which does
> > not make sense in classification.
> >
> > The following error message will further demonstrate this behavior.
> > imput0.txt is the file name of my new document.
> >
> > [main] ERROR com.me.classifier.mahout.MahoutClassifier - Error while
> > classifying documents
> > java.lang.IllegalArgumentException: Label not found: input0.txt
> >     at
> >
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:125)
> >     at
> >
> >
> org.apache.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:182)
> >     at
> >
> >
> org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:205)
> >     at
> >
> >
>
 org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:209)
> >     at
> >
> >
> org.apache.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:173)
> >     at
> >
> >
> org.apache.mahout.classifier.ResultAnalyzer.addInstance(ResultAnalyzer.java:70)
> >     at
> >
> >
> org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.analyzeResults(TestNaiveBayesDriver.java:160)
> >     at
> >
> >
> org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.run(TestNaiveBayesDriver.java:125)
> >    
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >     at
> >
> >
> org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.main(TestNaiveBayesDriver.java:66)
> >
> >
> > So how can I achieve what I'm trying to do here?
> >
> > Thanks,
> >
> >
> > --
> > M.P. Tharindu Rusira Kumara
> >
> > Department of Computer Science and Engineering,
> > University of Moratuwa,
> > Sri Lanka.
> > +94757033733
> > www.tharindu-rusira.blogspot.com
> >
>

Re: Naive Bayes classification

Posted by Kevin Moulart <ke...@gmail.com>.
To use Naive Bayes you need a Sequence File <Text, VectorWritable> with the
key formatted like this "label/label" for some reason I checked with the
sources to be sure and it parses it looking for a '/'.

When y used seqdirectory, it told Naive Bayes to classify the content of
each file (ex : file1.txt) with the label corresponding to its name (here,
file1.txt). So when you tried testing with input0.txt it failed because
input0.txt was not a valid label.

I designed a MapReduce java job that transforms a csv with numeric values
into a proper SequenceFile, if you want you can take the source and update
if to suit your need : https://github.com/kmoulart/hadoop_mahout_utils

Good luck.

Kévin Moulart


2014-03-18 20:13 GMT+01:00 Frank Scholten <fr...@frankscholten.nl>:

> Hi Tharindu,
>
> If I understand correctly seqdirectory creates labels based on the file
> name but this is not what you want. What do you want the labels to be?
>
> Cheers,
>
> Frank
>
>
> On Tue, Mar 18, 2014 at 2:22 PM, Tharindu Rusira
> <th...@gmail.com>wrote:
>
> > Hi everyone,
> > I'm developing an application where I need to train a Naive Bayes
> > classification model and use this model to classify new entities(In this
> > case text files based on their content)
> >
> > I observed that seqdirectory command always adds the file/directory name
> as
> > the "key" field for each document which will be used as the label in
> > classification jobs.
> > This makes sense when I need to train a model and create the labelindex
> > since I have organized my training data according to their labels in
> > separate directories.
> >
> > Now I'm trying to use this model and infer the best label for an unknown
> > document.
> > My requirement is to ask Mahout to read my new file and output the
> > predicted category by looking at the labelindex and the tfidf vector of
> the
> > new content.
> > I tried creating vectors from the new content (seqdirectory and
> > seq2sparse), and then using this vector to run testnb command. But
> > unfortunately seqdirectory commands adds file names as labels which does
> > not make sense in classification.
> >
> > The following error message will further demonstrate this behavior.
> > imput0.txt is the file name of my new document.
> >
> > [main] ERROR com.me.classifier.mahout.MahoutClassifier - Error while
> > classifying documents
> > java.lang.IllegalArgumentException: Label not found: input0.txt
> >     at
> >
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:125)
> >     at
> >
> >
> org.apache.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:182)
> >     at
> >
> >
> org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:205)
> >     at
> >
> >
> org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:209)
> >     at
> >
> >
> org.apache.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:173)
> >     at
> >
> >
> org.apache.mahout.classifier.ResultAnalyzer.addInstance(ResultAnalyzer.java:70)
> >     at
> >
> >
> org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.analyzeResults(TestNaiveBayesDriver.java:160)
> >     at
> >
> >
> org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.run(TestNaiveBayesDriver.java:125)
> >     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >     at
> >
> >
> org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.main(TestNaiveBayesDriver.java:66)
> >
> >
> > So how can I achieve what I'm trying to do here?
> >
> > Thanks,
> >
> >
> > --
> > M.P. Tharindu Rusira Kumara
> >
> > Department of Computer Science and Engineering,
> > University of Moratuwa,
> > Sri Lanka.
> > +94757033733
> > www.tharindu-rusira.blogspot.com
> >
>

Re: Naive Bayes classification

Posted by Frank Scholten <fr...@frankscholten.nl>.
Hi Tharindu,

If I understand correctly seqdirectory creates labels based on the file
name but this is not what you want. What do you want the labels to be?

Cheers,

Frank


On Tue, Mar 18, 2014 at 2:22 PM, Tharindu Rusira
<th...@gmail.com>wrote:

> Hi everyone,
> I'm developing an application where I need to train a Naive Bayes
> classification model and use this model to classify new entities(In this
> case text files based on their content)
>
> I observed that seqdirectory command always adds the file/directory name as
> the "key" field for each document which will be used as the label in
> classification jobs.
> This makes sense when I need to train a model and create the labelindex
> since I have organized my training data according to their labels in
> separate directories.
>
> Now I'm trying to use this model and infer the best label for an unknown
> document.
> My requirement is to ask Mahout to read my new file and output the
> predicted category by looking at the labelindex and the tfidf vector of the
> new content.
> I tried creating vectors from the new content (seqdirectory and
> seq2sparse), and then using this vector to run testnb command. But
> unfortunately seqdirectory commands adds file names as labels which does
> not make sense in classification.
>
> The following error message will further demonstrate this behavior.
> imput0.txt is the file name of my new document.
>
> [main] ERROR com.me.classifier.mahout.MahoutClassifier - Error while
> classifying documents
> java.lang.IllegalArgumentException: Label not found: input0.txt
>     at
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:125)
>     at
>
> org.apache.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:182)
>     at
>
> org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:205)
>     at
>
> org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:209)
>     at
>
> org.apache.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:173)
>     at
>
> org.apache.mahout.classifier.ResultAnalyzer.addInstance(ResultAnalyzer.java:70)
>     at
>
> org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.analyzeResults(TestNaiveBayesDriver.java:160)
>     at
>
> org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.run(TestNaiveBayesDriver.java:125)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>     at
>
> org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.main(TestNaiveBayesDriver.java:66)
>
>
> So how can I achieve what I'm trying to do here?
>
> Thanks,
>
>
> --
> M.P. Tharindu Rusira Kumara
>
> Department of Computer Science and Engineering,
> University of Moratuwa,
> Sri Lanka.
> +94757033733
> www.tharindu-rusira.blogspot.com
>