You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Salman Mahmood <sa...@influestor.com> on 2012/08/30 11:30:54 UTC

A few questions about Classification

I have got few questions about classification in mahout:

1) It is said that if the data set is small, then SDG is suited and naive bayes, if the data set is medium. I also learned that bayes classifier is for textual data as compared to continuous data.
I am classifying around 10,000 news articles  and they are all textual data. (no continuous variable used to determine classification). In my opinion the data set is small, so should I use SGD or Naive Bayes? (given the data is textual.)

2) Since multi-labeling is not supported in Mahout, I generated around 4,000 binary models using SGD. This way I know if a particular news item belongs to one or more classes ("Apple sues Samsung" belongs to class "Apple" and "Samsung"). 
The problem I am facing is the performance. It takes around 4 mins to classify one news item. Although the scalability is horizontal (meaning, it doesn't take 8 minutes to classify 2 news, 12 mins to classify 3 news and so on..), I still want to improve the throughput. What I am doing is loading a particular model and classifying N news, then loading the other model and classifying the N news again. Through this approach I am getting 16 mins to classify 1000 news items with 75-100 words per news. Is there a way to improve this further?(one option I am thinking is to use hadoop for the classification task. is that possible?)

3) Where can I find some good code examples/tutorials of training and testing Mahout classifier using naive Bayes? There are lots of examples on the net, but they all use command line. I need to know the code for Naive bayes because I do not have the dataset in files, but instead in database. and command line option read the dataset from files. Mahout in Action book gives you a good walkthrough for the SGD code but not for Naive Bayes.

Thanks! 

Re: A few questions about Classification

Posted by Lance Norskog <go...@gmail.com>.
3) database import
The most generic way is to use a Hadoop file reader that queries a
database. I don't know how to help you there.

In classify20newsgroups.sh, the first stage is:
  echo "Creating sequence files from 20newsgroups data"
  ./bin/mahout seqdirectory \
    -i ${WORK_DIR}/20news-all \
    -o ${WORK_DIR}/20news-seq

You need to replace this with something that reads a database and
creates Hadoop sequence files. The format is:
(Text,Text) where the key is a unique name for the document and the
value is the text of the document. The next step in the script turns
the text vectors into term-vectors. You do not have to change anything
after the above snippet.

 echo "Converting sequence files to vectors"
  ./bin/mahout seq2sparse \
    -i ${WORK_DIR}/20news-seq \
    -o ${WORK_DIR}/20news-vectors  -lnorm -nv  -wt tfidf



On Thu, Aug 30, 2012 at 2:30 AM, Salman Mahmood <sa...@influestor.com> wrote:
> I have got few questions about classification in mahout:
>
> 1) It is said that if the data set is small, then SDG is suited and naive bayes, if the data set is medium. I also learned that bayes classifier is for textual data as compared to continuous data.
> I am classifying around 10,000 news articles  and they are all textual data. (no continuous variable used to determine classification). In my opinion the data set is small, so should I use SGD or Naive Bayes? (given the data is textual.)
>
> 2) Since multi-labeling is not supported in Mahout, I generated around 4,000 binary models using SGD. This way I know if a particular news item belongs to one or more classes ("Apple sues Samsung" belongs to class "Apple" and "Samsung").
> The problem I am facing is the performance. It takes around 4 mins to classify one news item. Although the scalability is horizontal (meaning, it doesn't take 8 minutes to classify 2 news, 12 mins to classify 3 news and so on..), I still want to improve the throughput. What I am doing is loading a particular model and classifying N news, then loading the other model and classifying the N news again. Through this approach I am getting 16 mins to classify 1000 news items with 75-100 words per news. Is there a way to improve this further?(one option I am thinking is to use hadoop for the classification task. is that possible?)
>
> 3) Where can I find some good code examples/tutorials of training and testing Mahout classifier using naive Bayes? There are lots of examples on the net, but they all use command line. I need to know the code for Naive bayes because I do not have the dataset in files, but instead in database. and command line option read the dataset from files. Mahout in Action book gives you a good walkthrough for the SGD code but not for Naive Bayes.
>
> Thanks!



-- 
Lance Norskog
goksron@gmail.com