You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Jossef Harush <jo...@gmail.com> on 2014/05/03 19:40:30 UTC

Fwd: Mahout Naive Bayes CSV Classification

I have these 2 CSV files:

   1. train-set.csv
   2. test-set.csv

Both of them are in the same structure (with different content) and similar
to this example (http://i.stack.imgur.com/jsckr.png) :

[image: enter image description here]

Each column is a feature and the last column - class, is the name of the
class to predict.

.

*Can anyone please provide a sample code for:*

   1. Initializing Naive Bayes with a CSV file (model creation, training,
   required pre-processing, etc...)
   2. For a given CSV row - predicting a class

Thanks!

.

.

BTW -

I'm using Mahout 0.9 and Hadoop 2.4 and iv'e already tried to follow these
links:

http://web.archiveorange.com/archive/v/y0uRZw9Q4iHdjrm4Rfsu
http://chimpler.wordpress.com/2013/03/13/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages/

.

-- 
Sincerely,

Jossef Harush.
jossef.com <http://www.jossef.com>

RE: Mahout Naive Bayes CSV Classification

Posted by Andrew Palumbo <ap...@outlook.com>.

This would lead to that term not being counted by NaiveBayesModel.numFeatures().  NaiveBayesModel.numFeatures() returns the number of features (terms counts if this were a text classification problem) with a non-zero count across the entire input set.    



> From: jossef12@gmail.com
> Date: Tue, 6 May 2014 21:04:18 +0300
> Subject: Re: Mahout Naive Bayes CSV Classification
> To: user@mahout.apache.org
> 
> Yes
> 
> 
> On Mon, May 5, 2014 at 10:51 PM, Andrew Palumbo <ap...@outlook.com> wrote:
> 
> > Jossef,
> > Does your training set have any features with a zero value for all
> > instances?
> >
> > > Date: Mon, 5 May 2014 08:33:37 +0300
> > > Subject: RE: Mahout Naive Bayes CSV Classification
> > > From: jossef12@gmail.com
> > > To: user@mahout.apache.org
> > >
> > > a link to a github gist with my java code and a small sample from the CSV
> > > i'm using can be found here:
> > > https://gist.github.com/Jossef/e6c8fc0c31f0c2bf036a
> > > On May 5, 2014 5:53 AM, "Andrew Palumbo" <ap...@outlook.com> wrote:
> > >
> > > > Hi Jossef,
> > > >
> > > > I can answer your first two questions for you:
> > > >
> > > > > 1) Are these predicted values normal?
> > > >
> > > > Yes, negative scores are normal.
> > > >
> > > > > 2) For now, i'm assuming that the max value 'wins'. is that correct?
> > > >
> > > > That is correct, NaiveBayes uses a winner takes all approach to to
> > class
> > > > assignment based on the max score across all classes.  ie. :
> > > >
> > > > > {0:-2119.616101368751,1:-2536.217343666528}
> > > >
> > > > will be classified as 0.
> > > >
> > > > > 3) When i call 'naiveBayesModel.numFeatures()' (line 96 in
> > > > MahoutTest.java)
> > > > > it returns 40 instead of 41 features. Why is that?
> > > >
> > > > This seems odd.  Is it possible that something is getting dropped in
> > your
> > > > vectorization process?
> > > >
> > > > Could you give a little more information on how you're using this.
> >  Could
> > > > you please clarify what you're referring to re:  (line 96 in
> > > > MahoutTest.java)
> > > >
> > > > Thanks,
> > > >
> > > > Andy
> > > >
> > > > > From: jossef12@gmail.com
> > > > > Date: Sun, 4 May 2014 23:16:48 +0300
> > > > > Subject: Re: Fwd: Mahout Naive Bayes CSV Classification
> > > > > To: user@mahout.apache.org; ssc@apache.org
> > > > >
> > > > > Hey Sebastian,
> > > > >
> > > > > Thanks for your reply.
> > > > >
> > > > > a link to a github gist with my java code and a small sample from
> > the CSV
> > > > > i'm using can be found here:
> > > > > https://gist.github.com/Jossef/e6c8fc0c31f0c2bf036a
> > > > >
> > > > >
> > > > >
> > > > > I wrote code to convert the csv data (41 features + class name) to a
> > > > > RandomAccessSparseVector and appending it into a sequence file
> > > > >
> > > > > I successfully managed to create a model from the sequence file and
> > to
> > > > > run the NaiveBayes classifier with data.
> > > > >
> > > > >
> > > > > My problem is that i get negative results when i call '
> > > > > classifier.classifyFull'
> > > > >
> > > > > e.g. :
> > > > >
> > > > >
> > > > > {0:-2119.616101368751,1:-2536.217343666528}
> > > > > {0:-3210.7575139461096,1:-4569.913127240827}
> > > > > {0:-2986.049040829474,1:-3473.9551320126384}
> > > > > {0:-2411.582039236549,1:-3487.8547154600456}
> > > > > {0:-25620.824856365696,1:-31625.63011412386}
> > > > > {0:-4601.922062356241,1:-5019.98413435188}
> > > > > {0:-4331.835315861215,1:-4718.881475757016}
> > > > > {0:-3568.9589306062785,1:-4132.310969149298}
> > > > > ...
> > > > > ...
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > 1) Are these predicted values normal?
> > > > > 2) For now, i'm assuming that the max value 'wins'. is that correct?
> > > > > 3) When i call 'naiveBayesModel.numFeatures()' (line 96 in
> > > > MahoutTest.java)
> > > > > it returns 40 instead of 41 features. Why is that?
> > > > >
> > > > >
> > > > > Thanks :)
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Sun, May 4, 2014 at 2:25 PM, Sebastian Schelter <ss...@apache.org>
> > > > wrote:
> > > > >
> > > > > > Hi Jossef,
> > > > > >
> > > > > > You have to vectorize and normalize your data. The input for naive
> > > > bayes
> > > > > > is a sequencefile containing a Text object as key (your label) and
> > a
> > > > > > VectorWritable that holds a vector with the data.
> > > > > >
> > > > > > Instructions to run NaiveBayes can be found here:
> > > > > >
> > > > > > https://mahout.apache.org/users/classification/bayesian.html
> > > > > >
> > > > > > --sebastian
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 05/03/2014 07:40 PM, Jossef Harush wrote:
> > > > > >
> > > > > >> I have these 2 CSV files:
> > > > > >>
> > > > > >>     1. train-set.csv
> > > > > >>     2. test-set.csv
> > > > > >>
> > > > > >>
> > > > > >> Both of them are in the same structure (with different content)
> > and
> > > > > >> similar
> > > > > >> to this example (http://i.stack.imgur.com/jsckr.png) :
> > > > > >>
> > > > > >> [image: enter image description here]
> > > > > >>
> > > > > >> Each column is a feature and the last column - class, is the name
> > of
> > > > the
> > > > > >> class to predict.
> > > > > >>
> > > > > >> .
> > > > > >>
> > > > > >> *Can anyone please provide a sample code for:*
> > > > > >>
> > > > > >>     1. Initializing Naive Bayes with a CSV file (model creation,
> > > > training,
> > > > > >>     required pre-processing, etc...)
> > > > > >>     2. For a given CSV row - predicting a class
> > > > > >>
> > > > > >>
> > > > > >> Thanks!
> > > > > >>
> > > > > >> .
> > > > > >>
> > > > > >> .
> > > > > >>
> > > > > >> BTW -
> > > > > >>
> > > > > >> I'm using Mahout 0.9 and Hadoop 2.4 and iv'e already tried to
> > follow
> > > > these
> > > > > >> links:
> > > > > >>
> > > > > >> http://web.archiveorange.com/archive/v/y0uRZw9Q4iHdjrm4Rfsu
> > > > > >> http://chimpler.wordpress.com/2013/03/13/using-the-mahout-
> > > > > >> naive-bayes-classifier-to-automatically-classify-twitter-messages/
> > > > > >>
> > > > > >> .
> > > > > >> 
> > > > > >>
> > > > > >>
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Sincerely,
> > > >
> > > > >
> > > > > Jossef Harush.
> > > > > jossef.com <http://www.jossef.com>
> > > >
> >
> >
> 
> 
> 
> -- 
> Sincerely,
> 
> Jossef Harush.
> jossef.com <http://www.jossef.com>

Re: Mahout Naive Bayes CSV Classification

Posted by Jossef Harush <jo...@gmail.com>.

Yes


On Mon, May 5, 2014 at 10:51 PM, Andrew Palumbo <ap...@outlook.com> wrote:

> Jossef,
> Does your training set have any features with a zero value for all
> instances?
>
> > Date: Mon, 5 May 2014 08:33:37 +0300
> > Subject: RE: Mahout Naive Bayes CSV Classification
> > From: jossef12@gmail.com
> > To: user@mahout.apache.org
> >
> > a link to a github gist with my java code and a small sample from the CSV
> > i'm using can be found here:
> > https://gist.github.com/Jossef/e6c8fc0c31f0c2bf036a
> > On May 5, 2014 5:53 AM, "Andrew Palumbo" <ap...@outlook.com> wrote:
> >
> > > Hi Jossef,
> > >
> > > I can answer your first two questions for you:
> > >
> > > > 1) Are these predicted values normal?
> > >
> > > Yes, negative scores are normal.
> > >
> > > > 2) For now, i'm assuming that the max value 'wins'. is that correct?
> > >
> > > That is correct, NaiveBayes uses a winner takes all approach to to
> class
> > > assignment based on the max score across all classes.  ie. :
> > >
> > > > {0:-2119.616101368751,1:-2536.217343666528}
> > >
> > > will be classified as 0.
> > >
> > > > 3) When i call 'naiveBayesModel.numFeatures()' (line 96 in
> > > MahoutTest.java)
> > > > it returns 40 instead of 41 features. Why is that?
> > >
> > > This seems odd.  Is it possible that something is getting dropped in
> your
> > > vectorization process?
> > >
> > > Could you give a little more information on how you're using this.
>  Could
> > > you please clarify what you're referring to re:  (line 96 in
> > > MahoutTest.java)
> > >
> > > Thanks,
> > >
> > > Andy
> > >
> > > > From: jossef12@gmail.com
> > > > Date: Sun, 4 May 2014 23:16:48 +0300
> > > > Subject: Re: Fwd: Mahout Naive Bayes CSV Classification
> > > > To: user@mahout.apache.org; ssc@apache.org
> > > >
> > > > Hey Sebastian,
> > > >
> > > > Thanks for your reply.
> > > >
> > > > a link to a github gist with my java code and a small sample from
> the CSV
> > > > i'm using can be found here:
> > > > https://gist.github.com/Jossef/e6c8fc0c31f0c2bf036a
> > > >
> > > >
> > > >
> > > > I wrote code to convert the csv data (41 features + class name) to a
> > > > RandomAccessSparseVector and appending it into a sequence file
> > > >
> > > > I successfully managed to create a model from the sequence file and
> to
> > > > run the NaiveBayes classifier with data.
> > > >
> > > >
> > > > My problem is that i get negative results when i call '
> > > > classifier.classifyFull'
> > > >
> > > > e.g. :
> > > >
> > > >
> > > > {0:-2119.616101368751,1:-2536.217343666528}
> > > > {0:-3210.7575139461096,1:-4569.913127240827}
> > > > {0:-2986.049040829474,1:-3473.9551320126384}
> > > > {0:-2411.582039236549,1:-3487.8547154600456}
> > > > {0:-25620.824856365696,1:-31625.63011412386}
> > > > {0:-4601.922062356241,1:-5019.98413435188}
> > > > {0:-4331.835315861215,1:-4718.881475757016}
> > > > {0:-3568.9589306062785,1:-4132.310969149298}
> > > > ...
> > > > ...
> > > >
> > > >
> > > >
> > > >
> > > > 1) Are these predicted values normal?
> > > > 2) For now, i'm assuming that the max value 'wins'. is that correct?
> > > > 3) When i call 'naiveBayesModel.numFeatures()' (line 96 in
> > > MahoutTest.java)
> > > > it returns 40 instead of 41 features. Why is that?
> > > >
> > > >
> > > > Thanks :)
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Sun, May 4, 2014 at 2:25 PM, Sebastian Schelter <ss...@apache.org>
> > > wrote:
> > > >
> > > > > Hi Jossef,
> > > > >
> > > > > You have to vectorize and normalize your data. The input for naive
> > > bayes
> > > > > is a sequencefile containing a Text object as key (your label) and
> a
> > > > > VectorWritable that holds a vector with the data.
> > > > >
> > > > > Instructions to run NaiveBayes can be found here:
> > > > >
> > > > > https://mahout.apache.org/users/classification/bayesian.html
> > > > >
> > > > > --sebastian
> > > > >
> > > > >
> > > > >
> > > > > On 05/03/2014 07:40 PM, Jossef Harush wrote:
> > > > >
> > > > >> I have these 2 CSV files:
> > > > >>
> > > > >>     1. train-set.csv
> > > > >>     2. test-set.csv
> > > > >>
> > > > >>
> > > > >> Both of them are in the same structure (with different content)
> and
> > > > >> similar
> > > > >> to this example (http://i.stack.imgur.com/jsckr.png) :
> > > > >>
> > > > >> [image: enter image description here]
> > > > >>
> > > > >> Each column is a feature and the last column - class, is the name
> of
> > > the
> > > > >> class to predict.
> > > > >>
> > > > >> .
> > > > >>
> > > > >> *Can anyone please provide a sample code for:*
> > > > >>
> > > > >>     1. Initializing Naive Bayes with a CSV file (model creation,
> > > training,
> > > > >>     required pre-processing, etc...)
> > > > >>     2. For a given CSV row - predicting a class
> > > > >>
> > > > >>
> > > > >> Thanks!
> > > > >>
> > > > >> .
> > > > >>
> > > > >> .
> > > > >>
> > > > >> BTW -
> > > > >>
> > > > >> I'm using Mahout 0.9 and Hadoop 2.4 and iv'e already tried to
> follow
> > > these
> > > > >> links:
> > > > >>
> > > > >> http://web.archiveorange.com/archive/v/y0uRZw9Q4iHdjrm4Rfsu
> > > > >> http://chimpler.wordpress.com/2013/03/13/using-the-mahout-
> > > > >> naive-bayes-classifier-to-automatically-classify-twitter-messages/
> > > > >>
> > > > >> .
> > > > >> 
> > > > >>
> > > > >>
> > > > >
> > > >
> > > >
> > > > --
> > > > Sincerely,
> > >
> > > >
> > > > Jossef Harush.
> > > > jossef.com <http://www.jossef.com>
> > >
>
>



-- 
Sincerely,

Jossef Harush.
jossef.com <http://www.jossef.com>

RE: Mahout Naive Bayes CSV Classification

Posted by Andrew Palumbo <ap...@outlook.com>.

Jossef,
Does your training set have any features with a zero value for all instances?

> Date: Mon, 5 May 2014 08:33:37 +0300
> Subject: RE: Mahout Naive Bayes CSV Classification
> From: jossef12@gmail.com
> To: user@mahout.apache.org
> 
> a link to a github gist with my java code and a small sample from the CSV
> i'm using can be found here:
> https://gist.github.com/Jossef/e6c8fc0c31f0c2bf036a
> On May 5, 2014 5:53 AM, "Andrew Palumbo" <ap...@outlook.com> wrote:
> 
> > Hi Jossef,
> >
> > I can answer your first two questions for you:
> >
> > > 1) Are these predicted values normal?
> >
> > Yes, negative scores are normal.
> >
> > > 2) For now, i'm assuming that the max value 'wins'. is that correct?
> >
> > That is correct, NaiveBayes uses a winner takes all approach to to class
> > assignment based on the max score across all classes.  ie. :
> >
> > > {0:-2119.616101368751,1:-2536.217343666528}
> >
> > will be classified as 0.
> >
> > > 3) When i call 'naiveBayesModel.numFeatures()' (line 96 in
> > MahoutTest.java)
> > > it returns 40 instead of 41 features. Why is that?
> >
> > This seems odd.  Is it possible that something is getting dropped in your
> > vectorization process?
> >
> > Could you give a little more information on how you're using this.  Could
> > you please clarify what you're referring to re:  (line 96 in
> > MahoutTest.java)
> >
> > Thanks,
> >
> > Andy
> >
> > > From: jossef12@gmail.com
> > > Date: Sun, 4 May 2014 23:16:48 +0300
> > > Subject: Re: Fwd: Mahout Naive Bayes CSV Classification
> > > To: user@mahout.apache.org; ssc@apache.org
> > >
> > > Hey Sebastian,
> > >
> > > Thanks for your reply.
> > >
> > > a link to a github gist with my java code and a small sample from the CSV
> > > i'm using can be found here:
> > > https://gist.github.com/Jossef/e6c8fc0c31f0c2bf036a
> > >
> > >
> > >
> > > I wrote code to convert the csv data (41 features + class name) to a
> > > RandomAccessSparseVector and appending it into a sequence file
> > >
> > > I successfully managed to create a model from the sequence file and to
> > > run the NaiveBayes classifier with data.
> > >
> > >
> > > My problem is that i get negative results when i call '
> > > classifier.classifyFull'
> > >
> > > e.g. :
> > >
> > >
> > > {0:-2119.616101368751,1:-2536.217343666528}
> > > {0:-3210.7575139461096,1:-4569.913127240827}
> > > {0:-2986.049040829474,1:-3473.9551320126384}
> > > {0:-2411.582039236549,1:-3487.8547154600456}
> > > {0:-25620.824856365696,1:-31625.63011412386}
> > > {0:-4601.922062356241,1:-5019.98413435188}
> > > {0:-4331.835315861215,1:-4718.881475757016}
> > > {0:-3568.9589306062785,1:-4132.310969149298}
> > > ...
> > > ...
> > >
> > >
> > >
> > >
> > > 1) Are these predicted values normal?
> > > 2) For now, i'm assuming that the max value 'wins'. is that correct?
> > > 3) When i call 'naiveBayesModel.numFeatures()' (line 96 in
> > MahoutTest.java)
> > > it returns 40 instead of 41 features. Why is that?
> > >
> > >
> > > Thanks :)
> > >
> > >
> > >
> > >
> > >
> > > On Sun, May 4, 2014 at 2:25 PM, Sebastian Schelter <ss...@apache.org>
> > wrote:
> > >
> > > > Hi Jossef,
> > > >
> > > > You have to vectorize and normalize your data. The input for naive
> > bayes
> > > > is a sequencefile containing a Text object as key (your label) and a
> > > > VectorWritable that holds a vector with the data.
> > > >
> > > > Instructions to run NaiveBayes can be found here:
> > > >
> > > > https://mahout.apache.org/users/classification/bayesian.html
> > > >
> > > > --sebastian
> > > >
> > > >
> > > >
> > > > On 05/03/2014 07:40 PM, Jossef Harush wrote:
> > > >
> > > >> I have these 2 CSV files:
> > > >>
> > > >>     1. train-set.csv
> > > >>     2. test-set.csv
> > > >>
> > > >>
> > > >> Both of them are in the same structure (with different content) and
> > > >> similar
> > > >> to this example (http://i.stack.imgur.com/jsckr.png) :
> > > >>
> > > >> [image: enter image description here]
> > > >>
> > > >> Each column is a feature and the last column - class, is the name of
> > the
> > > >> class to predict.
> > > >>
> > > >> .
> > > >>
> > > >> *Can anyone please provide a sample code for:*
> > > >>
> > > >>     1. Initializing Naive Bayes with a CSV file (model creation,
> > training,
> > > >>     required pre-processing, etc...)
> > > >>     2. For a given CSV row - predicting a class
> > > >>
> > > >>
> > > >> Thanks!
> > > >>
> > > >> .
> > > >>
> > > >> .
> > > >>
> > > >> BTW -
> > > >>
> > > >> I'm using Mahout 0.9 and Hadoop 2.4 and iv'e already tried to follow
> > these
> > > >> links:
> > > >>
> > > >> http://web.archiveorange.com/archive/v/y0uRZw9Q4iHdjrm4Rfsu
> > > >> http://chimpler.wordpress.com/2013/03/13/using-the-mahout-
> > > >> naive-bayes-classifier-to-automatically-classify-twitter-messages/
> > > >>
> > > >> .
> > > >> 
> > > >>
> > > >>
> > > >
> > >
> > >
> > > --
> > > Sincerely,
> >
> > >
> > > Jossef Harush.
> > > jossef.com <http://www.jossef.com>
> >

RE: Mahout Naive Bayes CSV Classification

Posted by Jossef Harush <jo...@gmail.com>.

a link to a github gist with my java code and a small sample from the CSV
i'm using can be found here:
https://gist.github.com/Jossef/e6c8fc0c31f0c2bf036a
On May 5, 2014 5:53 AM, "Andrew Palumbo" <ap...@outlook.com> wrote:

> Hi Jossef,
>
> I can answer your first two questions for you:
>
> > 1) Are these predicted values normal?
>
> Yes, negative scores are normal.
>
> > 2) For now, i'm assuming that the max value 'wins'. is that correct?
>
> That is correct, NaiveBayes uses a winner takes all approach to to class
> assignment based on the max score across all classes.  ie. :
>
> > {0:-2119.616101368751,1:-2536.217343666528}
>
> will be classified as 0.
>
> > 3) When i call 'naiveBayesModel.numFeatures()' (line 96 in
> MahoutTest.java)
> > it returns 40 instead of 41 features. Why is that?
>
> This seems odd.  Is it possible that something is getting dropped in your
> vectorization process?
>
> Could you give a little more information on how you're using this.  Could
> you please clarify what you're referring to re:  (line 96 in
> MahoutTest.java)
>
> Thanks,
>
> Andy
>
> > From: jossef12@gmail.com
> > Date: Sun, 4 May 2014 23:16:48 +0300
> > Subject: Re: Fwd: Mahout Naive Bayes CSV Classification
> > To: user@mahout.apache.org; ssc@apache.org
> >
> > Hey Sebastian,
> >
> > Thanks for your reply.
> >
> > a link to a github gist with my java code and a small sample from the CSV
> > i'm using can be found here:
> > https://gist.github.com/Jossef/e6c8fc0c31f0c2bf036a
> >
> >
> >
> > I wrote code to convert the csv data (41 features + class name) to a
> > RandomAccessSparseVector and appending it into a sequence file
> >
> > I successfully managed to create a model from the sequence file and to
> > run the NaiveBayes classifier with data.
> >
> >
> > My problem is that i get negative results when i call '
> > classifier.classifyFull'
> >
> > e.g. :
> >
> >
> > {0:-2119.616101368751,1:-2536.217343666528}
> > {0:-3210.7575139461096,1:-4569.913127240827}
> > {0:-2986.049040829474,1:-3473.9551320126384}
> > {0:-2411.582039236549,1:-3487.8547154600456}
> > {0:-25620.824856365696,1:-31625.63011412386}
> > {0:-4601.922062356241,1:-5019.98413435188}
> > {0:-4331.835315861215,1:-4718.881475757016}
> > {0:-3568.9589306062785,1:-4132.310969149298}
> > ...
> > ...
> >
> >
> >
> >
> > 1) Are these predicted values normal?
> > 2) For now, i'm assuming that the max value 'wins'. is that correct?
> > 3) When i call 'naiveBayesModel.numFeatures()' (line 96 in
> MahoutTest.java)
> > it returns 40 instead of 41 features. Why is that?
> >
> >
> > Thanks :)
> >
> >
> >
> >
> >
> > On Sun, May 4, 2014 at 2:25 PM, Sebastian Schelter <ss...@apache.org>
> wrote:
> >
> > > Hi Jossef,
> > >
> > > You have to vectorize and normalize your data. The input for naive
> bayes
> > > is a sequencefile containing a Text object as key (your label) and a
> > > VectorWritable that holds a vector with the data.
> > >
> > > Instructions to run NaiveBayes can be found here:
> > >
> > > https://mahout.apache.org/users/classification/bayesian.html
> > >
> > > --sebastian
> > >
> > >
> > >
> > > On 05/03/2014 07:40 PM, Jossef Harush wrote:
> > >
> > >> I have these 2 CSV files:
> > >>
> > >>     1. train-set.csv
> > >>     2. test-set.csv
> > >>
> > >>
> > >> Both of them are in the same structure (with different content) and
> > >> similar
> > >> to this example (http://i.stack.imgur.com/jsckr.png) :
> > >>
> > >> [image: enter image description here]
> > >>
> > >> Each column is a feature and the last column - class, is the name of
> the
> > >> class to predict.
> > >>
> > >> .
> > >>
> > >> *Can anyone please provide a sample code for:*
> > >>
> > >>     1. Initializing Naive Bayes with a CSV file (model creation,
> training,
> > >>     required pre-processing, etc...)
> > >>     2. For a given CSV row - predicting a class
> > >>
> > >>
> > >> Thanks!
> > >>
> > >> .
> > >>
> > >> .
> > >>
> > >> BTW -
> > >>
> > >> I'm using Mahout 0.9 and Hadoop 2.4 and iv'e already tried to follow
> these
> > >> links:
> > >>
> > >> http://web.archiveorange.com/archive/v/y0uRZw9Q4iHdjrm4Rfsu
> > >> http://chimpler.wordpress.com/2013/03/13/using-the-mahout-
> > >> naive-bayes-classifier-to-automatically-classify-twitter-messages/
> > >>
> > >> .
> > >> 
> > >>
> > >>
> > >
> >
> >
> > --
> > Sincerely,
>
> >
> > Jossef Harush.
> > jossef.com <http://www.jossef.com>
>

RE: Mahout Naive Bayes CSV Classification

Posted by Andrew Palumbo <ap...@outlook.com>.

Hi Jossef,

I can answer your first two questions for you:
 
> 1) Are these predicted values normal?

Yes, negative scores are normal.  

> 2) For now, i'm assuming that the max value 'wins'. is that correct?

That is correct, NaiveBayes uses a winner takes all approach to to class assignment based on the max score across all classes.  ie. :

> {0:-2119.616101368751,1:-2536.217343666528}

will be classified as 0. 

> 3) When i call 'naiveBayesModel.numFeatures()' (line 96 in MahoutTest.java)
> it returns 40 instead of 41 features. Why is that?

This seems odd.  Is it possible that something is getting dropped in your vectorization process?  

Could you give a little more information on how you're using this.  Could you please clarify what you're referring to re:  (line 96 in MahoutTest.java)

Thanks,

Andy   

> From: jossef12@gmail.com
> Date: Sun, 4 May 2014 23:16:48 +0300
> Subject: Re: Fwd: Mahout Naive Bayes CSV Classification
> To: user@mahout.apache.org; ssc@apache.org
> 
> Hey Sebastian,
> 
> Thanks for your reply.
> 
> a link to a github gist with my java code and a small sample from the CSV
> i'm using can be found here:
> https://gist.github.com/Jossef/e6c8fc0c31f0c2bf036a
> 
> 
> 
> I wrote code to convert the csv data (41 features + class name) to a
> RandomAccessSparseVector and appending it into a sequence file
> 
> I successfully managed to create a model from the sequence file and to
> run the NaiveBayes classifier with data.
> 
> 
> My problem is that i get negative results when i call '
> classifier.classifyFull'
> 
> e.g. :
> 
> 
> {0:-2119.616101368751,1:-2536.217343666528}
> {0:-3210.7575139461096,1:-4569.913127240827}
> {0:-2986.049040829474,1:-3473.9551320126384}
> {0:-2411.582039236549,1:-3487.8547154600456}
> {0:-25620.824856365696,1:-31625.63011412386}
> {0:-4601.922062356241,1:-5019.98413435188}
> {0:-4331.835315861215,1:-4718.881475757016}
> {0:-3568.9589306062785,1:-4132.310969149298}
> ...
> ...
> 
> 
> 
> 
> 1) Are these predicted values normal?
> 2) For now, i'm assuming that the max value 'wins'. is that correct?
> 3) When i call 'naiveBayesModel.numFeatures()' (line 96 in MahoutTest.java)
> it returns 40 instead of 41 features. Why is that?
> 
> 
> Thanks :)
> 
> 
> 
> 
> 
> On Sun, May 4, 2014 at 2:25 PM, Sebastian Schelter <ss...@apache.org> wrote:
> 
> > Hi Jossef,
> >
> > You have to vectorize and normalize your data. The input for naive bayes
> > is a sequencefile containing a Text object as key (your label) and a
> > VectorWritable that holds a vector with the data.
> >
> > Instructions to run NaiveBayes can be found here:
> >
> > https://mahout.apache.org/users/classification/bayesian.html
> >
> > --sebastian
> >
> >
> >
> > On 05/03/2014 07:40 PM, Jossef Harush wrote:
> >
> >> I have these 2 CSV files:
> >>
> >>     1. train-set.csv
> >>     2. test-set.csv
> >>
> >>
> >> Both of them are in the same structure (with different content) and
> >> similar
> >> to this example (http://i.stack.imgur.com/jsckr.png) :
> >>
> >> [image: enter image description here]
> >>
> >> Each column is a feature and the last column - class, is the name of the
> >> class to predict.
> >>
> >> .
> >>
> >> *Can anyone please provide a sample code for:*
> >>
> >>     1. Initializing Naive Bayes with a CSV file (model creation, training,
> >>     required pre-processing, etc...)
> >>     2. For a given CSV row - predicting a class
> >>
> >>
> >> Thanks!
> >>
> >> .
> >>
> >> .
> >>
> >> BTW -
> >>
> >> I'm using Mahout 0.9 and Hadoop 2.4 and iv'e already tried to follow these
> >> links:
> >>
> >> http://web.archiveorange.com/archive/v/y0uRZw9Q4iHdjrm4Rfsu
> >> http://chimpler.wordpress.com/2013/03/13/using-the-mahout-
> >> naive-bayes-classifier-to-automatically-classify-twitter-messages/
> >>
> >> .
> >> 
> >>
> >>
> >
>  
> 
> -- 
> Sincerely,

> 
> Jossef Harush.
> jossef.com <http://www.jossef.com>

Re: Fwd: Mahout Naive Bayes CSV Classification

Posted by Jossef Harush <jo...@gmail.com>.

Hey Sebastian,

Thanks for your reply.

a link to a github gist with my java code and a small sample from the CSV
i'm using can be found here:
https://gist.github.com/Jossef/e6c8fc0c31f0c2bf036a



I wrote code to convert the csv data (41 features + class name) to a
RandomAccessSparseVector and appending it into a sequence file

I successfully managed to create a model from the sequence file and to
run the NaiveBayes classifier with data.


My problem is that i get negative results when i call '
classifier.classifyFull'

e.g. :


{0:-2119.616101368751,1:-2536.217343666528}
{0:-3210.7575139461096,1:-4569.913127240827}
{0:-2986.049040829474,1:-3473.9551320126384}
{0:-2411.582039236549,1:-3487.8547154600456}
{0:-25620.824856365696,1:-31625.63011412386}
{0:-4601.922062356241,1:-5019.98413435188}
{0:-4331.835315861215,1:-4718.881475757016}
{0:-3568.9589306062785,1:-4132.310969149298}
...
...




1) Are these predicted values normal?
2) For now, i'm assuming that the max value 'wins'. is that correct?
3) When i call 'naiveBayesModel.numFeatures()' (line 96 in MahoutTest.java)
it returns 40 instead of 41 features. Why is that?


Thanks :)





On Sun, May 4, 2014 at 2:25 PM, Sebastian Schelter <ss...@apache.org> wrote:

> Hi Jossef,
>
> You have to vectorize and normalize your data. The input for naive bayes
> is a sequencefile containing a Text object as key (your label) and a
> VectorWritable that holds a vector with the data.
>
> Instructions to run NaiveBayes can be found here:
>
> https://mahout.apache.org/users/classification/bayesian.html
>
> --sebastian
>
>
>
> On 05/03/2014 07:40 PM, Jossef Harush wrote:
>
>> I have these 2 CSV files:
>>
>>     1. train-set.csv
>>     2. test-set.csv
>>
>>
>> Both of them are in the same structure (with different content) and
>> similar
>> to this example (http://i.stack.imgur.com/jsckr.png) :
>>
>> [image: enter image description here]
>>
>> Each column is a feature and the last column - class, is the name of the
>> class to predict.
>>
>> .
>>
>> *Can anyone please provide a sample code for:*
>>
>>     1. Initializing Naive Bayes with a CSV file (model creation, training,
>>     required pre-processing, etc...)
>>     2. For a given CSV row - predicting a class
>>
>>
>> Thanks!
>>
>> .
>>
>> .
>>
>> BTW -
>>
>> I'm using Mahout 0.9 and Hadoop 2.4 and iv'e already tried to follow these
>> links:
>>
>> http://web.archiveorange.com/archive/v/y0uRZw9Q4iHdjrm4Rfsu
>> http://chimpler.wordpress.com/2013/03/13/using-the-mahout-
>> naive-bayes-classifier-to-automatically-classify-twitter-messages/
>>
>> .
>> 
>>
>>
>


-- 
Sincerely,

Jossef Harush.
jossef.com <http://www.jossef.com>

Re: Fwd: Mahout Naive Bayes CSV Classification

Posted by Sebastian Schelter <ss...@apache.org>.

Hi Jossef,

You have to vectorize and normalize your data. The input for naive bayes 
is a sequencefile containing a Text object as key (your label) and a 
VectorWritable that holds a vector with the data.

Instructions to run NaiveBayes can be found here:

https://mahout.apache.org/users/classification/bayesian.html

--sebastian


On 05/03/2014 07:40 PM, Jossef Harush wrote:
> I have these 2 CSV files:
>
>     1. train-set.csv
>     2. test-set.csv
>
> Both of them are in the same structure (with different content) and similar
> to this example (http://i.stack.imgur.com/jsckr.png) :
>
> [image: enter image description here]
>
> Each column is a feature and the last column - class, is the name of the
> class to predict.
>
> .
>
> *Can anyone please provide a sample code for:*
>
>     1. Initializing Naive Bayes with a CSV file (model creation, training,
>     required pre-processing, etc...)
>     2. For a given CSV row - predicting a class
>
> Thanks!
>
> .
>
> .
>
> BTW -
>
> I'm using Mahout 0.9 and Hadoop 2.4 and iv'e already tried to follow these
> links:
>
> http://web.archiveorange.com/archive/v/y0uRZw9Q4iHdjrm4Rfsu
> http://chimpler.wordpress.com/2013/03/13/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages/
>
> .
> 
>