You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Tom Sercu <to...@gmail.com> on 2014/04/26 03:41:19 UTC

seq2sparse, keep train/test dataset separated - Reuse existing dictionary and IDF values

Dear all,
I'm trying to achieve something simple as a toy example for Mahout.
I want to prepare a collection of text documents for sentiment
classification. The labels are P(ositive) and N(egative). I have a train
and test set. Thus in total 4 datasets: train_P, train_N, test_P, test_N.
After converting them to sequencefiles with key "/dataset/docid" and value
the text, I want to vectorize the training set and test set over the same
dictionary/IDF, but with the training and test set kept separate in the
output.

The best way to do this would be to re-use the dictionary and IDF values.
Is this possible? I guess this is not implemented, from the unsolved
discussion on this mailinglist on Apr 17th 2011 and the answerless
Stackoverflow question on
http://stackoverflow.com/questions/20885406/can-the-mahout-seq2sparse-command-use-the-previous-generated-dictionary
In this case the dictionary/IDF will only pick up words from the training
data, which is perfectly fine.

The second option I see is to give all 4 input files as input to the same
seq2sparse command, obtaining one tfidf-vectorized dataset, and then split
afterwards by writing a java program that reads through the whole
concatenated dataset and splits it up in a train and a test dataset, where
the model targets are replaced (train_P, test_P) become P, (train_N,
test_N) become N.

This situation emerges in any case where you train a model and want to use
it to classify a new/unseen model. Therefore option 1 is clearly the best.

Thanks for any guidance on this,
Tom Sercu