You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Tharindu Rusira <th...@gmail.com> on 2013/12/11 07:56:30 UTC

Mahout for text classification

Hello,
I'm currently comparing Mahout classification algorithms that can be used
in text classification? I checked [1] but many of them have open issues so
I'm not sure which of them are in working condition and properly supported
by Mahout.
According to what I've found so far, SVM is preferred for text
classification because of its ability to work with high dimensional feature
spaces.
Also I've gone through MAHOUT-334 and this[2] recent mail thread.

According to the wiki, Naive Bayes seems to be a reliable candidate for a
classification task. Could someone please provide more details on this and
the suitability of Naive Bayes for text classification?

Thanks,

[1] https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms


[2] 
http://mail-archives.apache.org/mod_mbox/mahout-user/201312.mbox/%3cCAO+E6vdkE1PjLi4wTDCg8x-H4GYB3Bpe8bGriwx2j6j7istKjQ@mail.gmail.com%3e


-- 
M.P. Tharindu Rusira Kumara

Department of Computer Science and Engineering,
University of Moratuwa,
Sri Lanka.
+94757033733
www.tharindu-rusira.blogspot.com

Re: Mahout for text classification

Posted by tuku <ut...@gmail.com>.

I am currently using naive bayes for text classification.
I prefer NB over SVM because;
- SVM has long training time
- NB can be incremental
- NB can be fully parallel

the main decisions you should make while using NB is using tf or tfidf and
using binary NB or multinomial
if you classify short texts like tweets you should use tf, otherwise tfidf
there is no binary implementation for NB in mahout but I implemented one in
a very short time ( I mean it is easy ) and binary usually is better than
multinomial on text classification

~--~--~--~--~--~--~--~--~--~--~--~--~--~--~--~--~--~--~--~--~--~--~--~--~--~--~--~--~--~--~--~--~--~--~--~--~
If you don't know where you're going, any road will get you there
"Not all those who wander are lost." - J.R.R. Tolkien
"Fish don't know they're in water."
"Smile, breathe and go slowly." - Thich Nhat Hanh, Zen Buddhist monk
"Zamanlarını para kazanmak ve saklamakla geçirenler, sonunda, en çok
istediklerinin satın alınamayacak şeyler olduğunu anlarlar."
"And in the end, it's not the years in your life that count. It's the life
in your years."
"in 20 years, you will be more dissapointed by what you didn't do than what
you did."
"If you want to go fast, go alone. If you want to go far, go with others."
"Remember, happiness is a way of travel not a destination"
"A good traveller has no fixed plans, and is not intent on arriving."

On 11 December 2013 08:56, Tharindu Rusira <th...@gmail.com> wrote:

> Hello,
> I'm currently comparing Mahout classification algorithms that can be used
> in text classification? I checked [1] but many of them have open issues so
> I'm not sure which of them are in working condition and properly supported
> by Mahout.
> According to what I've found so far, SVM is preferred for text
> classification because of its ability to work with high dimensional feature
> spaces.
> Also I've gone through MAHOUT-334 and this[2] recent mail thread.
>
> According to the wiki, Naive Bayes seems to be a reliable candidate for a
> classification task. Could someone please provide more details on this and
> the suitability of Naive Bayes for text classification?
>
> Thanks,
>
> [1] https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
>
>
> [2]
>
> http://mail-archives.apache.org/mod_mbox/mahout-user/201312.mbox/%3cCAO+E6vdkE1PjLi4wTDCg8x-H4GYB3Bpe8bGriwx2j6j7istKjQ@mail.gmail.com%3e
>
>
> --
> M.P. Tharindu Rusira Kumara
>
> Department of Computer Science and Engineering,
> University of Moratuwa,
> Sri Lanka.
> +94757033733
> www.tharindu-rusira.blogspot.com
>