You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Eugen Cepoi <ce...@gmail.com> on 2013/02/05 15:06:06 UTC

document classification command line wrong call order?

Hi there,

All the following is based on mahout 0.7.

I have seen a couple of examples doing the following for document
classification (by command line, using files/folders).

1) seqdirectory to prepare the correct file format
2) seq2sparse to build the feature vectors
3) split split the data into train/test set
etc

I find it strange that the split is applied after seq2sparse. Aren't you
using all the available data while building the feature vectors (for
example computing tfidf on all docs etc)? If it is the case then everything
is biased (in the examples I have seen not in mahout ;)) as the classifier
has already all the information on the data he will be tested.

I found it while doing some testing for text classification. When I was
doing the split manually the results were a lot worse (~50% vs 97%). Sure
50% is not good, but it is probably due to my dataset as its only about 1K
docs.


Another question. Could you please point me to some resources on using
other mahout algorithms for text classification (command line or
programmatic).

Thanks,
Eugen