You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by John Meagher <jo...@gmail.com> on 2013/08/07 23:00:37 UTC

Arff files to Naive Bayes

I'm just starting work with Mahout and I'm struggling getting an
example of a non-text based Naive Bayes classifier up and running.
The input will be feature vectors generated outside of Mahout.  As a
test I'm using arff files (anything else CSV-ish will work).  I've
been able to convert things into vectors in a few different ways, but
can't figure out what is needed to get the trainnb command to work.

Does the label index need to be generated through some manual process
or something other than the arff.vector or trainnb command?

Is there a specific format needed for the input arff files?  Specific
columns in a specific order?


Here's what I've tried so far in both 0.7 from CDH4 and 0.8 direct from Apache:

$ wget http://repository.seasr.org/Datasets/UCI/arff/iris.arff
$ mahout arff.vector --input iris.arff --output iris.model --dictOut iris.labels

This works and seems to be right so far

This is the command I think I need to train the Naive Bayes model.  It
fails when creating the label index with the exception below.

$ mahout trainnb -i iris.model/ -o iris.training -el -li iris.training.labels
...
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.mahout.classifier.naivebayes.BayesUtils.writeLabelIndex(BayesUtils.java:123)
at org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.createLabelIndex(TrainNaiveBayesJob.java:180)
at org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:94)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
...


Thanks for the help,
John

Re: Arff files to Naive Bayes

Posted by Ted Dunning <te...@gmail.com>.
On Wed, Aug 7, 2013 at 3:56 PM, John Meagher <jo...@gmail.com> wrote:

> Continuous values are being used now in addition to a large set of
> boolean flags.  I think I could convert the continuous values to some
> sort of bucketed values that could be represented as additional flags.
>  If that was the case would the format need to be ...
> id1 flaga flagb
> id2 flagb flagc
>

Yes.

Re: Arff files to Naive Bayes

Posted by John Meagher <jo...@gmail.com>.
Continuous values are being used now in addition to a large set of
boolean flags.  I think I could convert the continuous values to some
sort of bucketed values that could be represented as additional flags.
 If that was the case would the format need to be ...
id1 flaga flagb
id2 flagb flagc

Also, I'm working more towards getting an example of going from
feature vectors rather than a text document that can be turned over to
a data science group.  Naive Bayes is what is being used now with data
extracted via Hive and loaded into R.  As a start I'm trying to come
up with an example that replicates that data flow using data in Hive
and Mahout for processing.

On Wed, Aug 7, 2013 at 6:29 PM, Ted Dunning <te...@gmail.com> wrote:
> By non-text, do you mean continuous values?   Or sparse sets of tokens?
>
> The general idea for Naive Bayes is that it requires input consisting of
> sparse sets of tokens.
>
>
>
> On Wed, Aug 7, 2013 at 2:00 PM, John Meagher <jo...@gmail.com> wrote:
>
>> I'm just starting work with Mahout and I'm struggling getting an
>> example of a non-text based Naive Bayes classifier up and running.
>> The input will be feature vectors generated outside of Mahout.  As a
>> test I'm using arff files (anything else CSV-ish will work).  I've
>> been able to convert things into vectors in a few different ways, but
>> can't figure out what is needed to get the trainnb command to work.
>>
>> Does the label index need to be generated through some manual process
>> or something other than the arff.vector or trainnb command?
>>
>> Is there a specific format needed for the input arff files?  Specific
>> columns in a specific order?
>>
>>
>> Here's what I've tried so far in both 0.7 from CDH4 and 0.8 direct from
>> Apache:
>>
>> $ wget http://repository.seasr.org/Datasets/UCI/arff/iris.arff
>> $ mahout arff.vector --input iris.arff --output iris.model --dictOut
>> iris.labels
>>
>> This works and seems to be right so far
>>
>> This is the command I think I need to train the Naive Bayes model.  It
>> fails when creating the label index with the exception below.
>>
>> $ mahout trainnb -i iris.model/ -o iris.training -el -li
>> iris.training.labels
>> ...
>> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
>> at
>> org.apache.mahout.classifier.naivebayes.BayesUtils.writeLabelIndex(BayesUtils.java:123)
>> at
>> org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.createLabelIndex(TrainNaiveBayesJob.java:180)
>> at
>> org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:94)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> ...
>>
>>
>> Thanks for the help,
>> John
>>

Re: Arff files to Naive Bayes

Posted by Ted Dunning <te...@gmail.com>.
By non-text, do you mean continuous values?   Or sparse sets of tokens?

The general idea for Naive Bayes is that it requires input consisting of
sparse sets of tokens.



On Wed, Aug 7, 2013 at 2:00 PM, John Meagher <jo...@gmail.com> wrote:

> I'm just starting work with Mahout and I'm struggling getting an
> example of a non-text based Naive Bayes classifier up and running.
> The input will be feature vectors generated outside of Mahout.  As a
> test I'm using arff files (anything else CSV-ish will work).  I've
> been able to convert things into vectors in a few different ways, but
> can't figure out what is needed to get the trainnb command to work.
>
> Does the label index need to be generated through some manual process
> or something other than the arff.vector or trainnb command?
>
> Is there a specific format needed for the input arff files?  Specific
> columns in a specific order?
>
>
> Here's what I've tried so far in both 0.7 from CDH4 and 0.8 direct from
> Apache:
>
> $ wget http://repository.seasr.org/Datasets/UCI/arff/iris.arff
> $ mahout arff.vector --input iris.arff --output iris.model --dictOut
> iris.labels
>
> This works and seems to be right so far
>
> This is the command I think I need to train the Naive Bayes model.  It
> fails when creating the label index with the exception below.
>
> $ mahout trainnb -i iris.model/ -o iris.training -el -li
> iris.training.labels
> ...
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
> at
> org.apache.mahout.classifier.naivebayes.BayesUtils.writeLabelIndex(BayesUtils.java:123)
> at
> org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.createLabelIndex(TrainNaiveBayesJob.java:180)
> at
> org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:94)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> ...
>
>
> Thanks for the help,
> John
>