You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Jason Surratt <ja...@spadac.com> on 2010/02/17 22:58:22 UTC

Naive Bayes on Well Structured Data

Hello!

I'm new to Mahout, but I've been doing ML and Hadoop for a while now. I've got a fairly large dataset (about 1 billion records and ~200 columns) and a client interested in performing binary classification on the data. I've done some preliminary investigation with subsamples of the data in Weka and Naive Bayes performs surprisingly well. My data has some features with large numbers of nominal values in it that would benefit from very large sample sizes when training. In the end I need to do something like the following:

* Read data from a tab delimited file

* Discretize the numeric data (simply equal interval binning will probably be fine)

* Build a NB or similarly performing classifier on a relatively large training data set

* Evaluate against a test set of similarly structured data

* Generate ROC curves and similar evaluation metrics against the test set

I've gone through the Twenty Newsgroups examples and it appears that Mahout has some of the building blocks I need, but may be missing others. I'm comfortable writing all of these pieces from scratch, but I'd prefer to build this functionality into Mahout or a similar open source project and I have the support of my employer to do so.

Which leads me to my questions: Does Mahout already have all the functionality that I'm looking for and I just missed it? Would this be beneficial and in line with Mahout? If this does make sense, where would you suggest I start?

Thanks in advance!

Jason R. Surratt
SPADAC
email: jason.surratt@spadac.com

Re: Naive Bayes on Well Structured Data

Posted by Ted Dunning <te...@gmail.com>.

On Wed, Feb 17, 2010 at 7:10 PM, Jason Surratt <ja...@spadac.com>wrote:

> I've spent a bit of time looking over Drew's Avro stuff as well as
> http://issues.apache.org/jira/browse/MAHOUT-262, SVM and the SGD
> implementations. Is it the intent for classifiers to use
> SingleLabelVectorWritable as the input value during the map step at some
> point in the future? If so, I'm happy to write up some code around Naive
> Bayes and an input format to do just that -- maybe it'll be useful to
> someone else.
>

We definitely want to have a common input format for all algorithms (where
it makes sense).  The two candidates are honest to goodness sparse or dense
vectors versus something like a document.  Since it saves a huge amount of
effort to integrate the conversion from document to vector directly into the
algorithm it is looking like all algorithms will need to support both.

Doing that without lots of effort in each algorithm is the trick that Robin
and Drew are working on just now.  Your contributions would be invaluable
(you are a real live user!)

>  There is a lot of code and JIRAs to take in so I apologize if I'm missing
> something.
>

No problem.  It is an exciting project that way.

-- 
Ted Dunning, CTO
DeepDyve

RE: Naive Bayes on Well Structured Data

Posted by Jason Surratt <ja...@spadac.com>.

Ted,

Thanks for the speedy response!

> This sounds great.  I would suggest you test the naive Bayes,
> complementary
> Naive Bayes, SVM and SGD implementations.  Given that naive Bayes has
> worked
> well on a sample, you will probably be very happy with SVM and SGD
> since
> they handle very large cardinality well.

Thanks! I'll be sure and try the other classifiers after I get NB working.

> You will need to vectorize your input.  Since you have many columns,
> you may
> want to look at Drew's document style stuff.  See
> https://issues.apache.org/jira/browse/MAHOUT-274

I've spent a bit of time looking over Drew's Avro stuff as well as http://issues.apache.org/jira/browse/MAHOUT-262, SVM and the SGD implementations. Is it the intent for classifiers to use SingleLabelVectorWritable as the input value during the map step at some point in the future? If so, I'm happy to write up some code around Naive Bayes and an input format to do just that -- maybe it'll be useful to someone else.

There is a lot of code and JIRAs to take in so I apologize if I'm missing something.

Cheers!

-jason

Re: Naive Bayes on Well Structured Data

Posted by Ted Dunning <te...@gmail.com>.

This sounds great.  I would suggest you test the naive Bayes, complementary
Naive Bayes, SVM and SGD implementations.  Given that naive Bayes has worked
well on a sample, you will probably be very happy with SVM and SGD since
they handle very large cardinality well.

You will need to vectorize your input.  Since you have many columns, you may
want to look at Drew's document style stuff.  See
https://issues.apache.org/jira/browse/MAHOUT-274

There is the beginnings of some vectorization of hte sort you will need in
the SGD patch: http://issues.apache.org/jira/browse/MAHOUT-228  That also
has a learning system that will build your classifier using an on-line
logistic regression.

The SVM implementation is at http://issues.apache.org/jira/browse/MAHOUT-232

The NB and CNB implementations are in mahout itself already.

On Wed, Feb 17, 2010 at 1:58 PM, Jason Surratt <ja...@spadac.com>wrote:

> Which leads me to my questions: Does Mahout already have all the
> functionality that I'm looking for and I just missed it? Would this be
> beneficial and in line with Mahout? If this does make sense, where would you
> suggest I start?
>

-- 
Ted Dunning, CTO
DeepDyve