You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Loic Descotte <lo...@kelkoo.com> on 2011/10/07 10:40:38 UTC

Some questions about text classification with Naives Bayes/SGD

Hi,
My problem is to classify multi-attributes text files.
Some attributes are in text format, others are numeric (e.g in CSV format).
I've started to work with Naive Bayes and SGD.

And no I have a few questions for each :) (sorry for the long mail, I'm 
trying to group all my questions to avoid posting dozen of messages)
_
Naives bayes_

- *TrainClassifier* and *TestClassifier* classes take a data directory 
in input. If I work with CSV files (multi attributes), how to setup the 
algorithm for each attribute (ex : the 3rd attribute is numeric etc.) ?

- I'm thinking about he way to make some data entries match to several 
categories. Is this kind of thing possible with classifications 
algorithms (e.g. Naive Bayes)?
For example, if I want to tag news , some of them could be both 
"international" and "politics" news

- I've tried to classify classify 110 000 entries (after learning on 440 
000 entries) and Mahout fails with Java Heap Space, even with more than 
2 Go of memory on the JVM.
Do I have a configuration issue or does it seem normal?

- I have good results with small data sets with Naive Bayes, better than 
SVM and SGD tests I've done (I've tried many algorithms with wekka on my 
CSV files).
The theory says that Naives Bayes only fits with big data sets, so is it 
dangerous to choose it anayway for small datasets analysis?
For example, with 80 entries for learning and 4 categories, I have 90% 
of success on my text files (with 40 entries in test data). SVM gives 
very bad score for this (~40%), Logistic regression ~60%
So I'm very confused....
Maybe my entries are very simple for Naive Baye, so it does not need a 
lot of data for learning?


_
SGD_

I worked on the basis of the *RunLogistic* found in Mahout Examples.
As my examples have more than 2 categories.
I used classifyFull() method form *OnlineLogisticRegression* instead of 
classifyScalar().
For the evaluation of the model, I had to modify the *Auc* class, 
because it was able
to manage a matrix of only 2 elements.

It works fine with small data sets, but now I have some strange results 
with bigger sets.
Maybe I've done it wrong...
So my question is : is there a way in Mahout to classify and test (and 
have some metrics like Auc and Confusion) more than 2 categories without 
modify the provided classes?

Thanks a lot!

Loic