You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Loic Descotte <lo...@kelkoo.com> on 2011/10/07 10:40:38 UTC
Some questions about text classification with Naives Bayes/SGD
Hi,
My problem is to classify multi-attributes text files.
Some attributes are in text format, others are numeric (e.g in CSV format).
I've started to work with Naive Bayes and SGD.
And no I have a few questions for each :) (sorry for the long mail, I'm
trying to group all my questions to avoid posting dozen of messages)
_
Naives bayes_
- *TrainClassifier* and *TestClassifier* classes take a data directory
in input. If I work with CSV files (multi attributes), how to setup the
algorithm for each attribute (ex : the 3rd attribute is numeric etc.) ?
- I'm thinking about he way to make some data entries match to several
categories. Is this kind of thing possible with classifications
algorithms (e.g. Naive Bayes)?
For example, if I want to tag news , some of them could be both
"international" and "politics" news
- I've tried to classify classify 110 000 entries (after learning on 440
000 entries) and Mahout fails with Java Heap Space, even with more than
2 Go of memory on the JVM.
Do I have a configuration issue or does it seem normal?
- I have good results with small data sets with Naive Bayes, better than
SVM and SGD tests I've done (I've tried many algorithms with wekka on my
CSV files).
The theory says that Naives Bayes only fits with big data sets, so is it
dangerous to choose it anayway for small datasets analysis?
For example, with 80 entries for learning and 4 categories, I have 90%
of success on my text files (with 40 entries in test data). SVM gives
very bad score for this (~40%), Logistic regression ~60%
So I'm very confused....
Maybe my entries are very simple for Naive Baye, so it does not need a
lot of data for learning?
_
SGD_
I worked on the basis of the *RunLogistic* found in Mahout Examples.
As my examples have more than 2 categories.
I used classifyFull() method form *OnlineLogisticRegression* instead of
classifyScalar().
For the evaluation of the model, I had to modify the *Auc* class,
because it was able
to manage a matrix of only 2 elements.
It works fine with small data sets, but now I have some strange results
with bigger sets.
Maybe I've done it wrong...
So my question is : is there a way in Mahout to classify and test (and
have some metrics like Auc and Confusion) more than 2 categories without
modify the provided classes?
Thanks a lot!
Loic