You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Aneesha <an...@gmail.com> on 2012/04/29 07:11:48 UTC

LDA input

I create sequential file and create vector for k-means. Is it the same input we
need to use for Latent Dirichlet Allocation????  


Re: LDA input

Posted by ivan obeso <se...@gmail.com>.
Yes. You make a sequencial file using, for example, the SequenceFile.Writer
class writing the name of the file as key, and all the content as the
value. You can write as files as you want into the sequence file.

Then, you use this *.seq as a input for DocumentProcessor.tokenizeDocuments
to tokenize this file (you can use here a stemmer). The result of this is a
folder with the files containing the tokens. This folder must be the input
of the DictionaryVectorizer.createTermFrequencyVectors class to create the
TFvectors of the corpus. Finally, this folder is the input of the LDA
algotithm that you can use with the "bin/mahout lda" tool, or calling it
from a java program.

It's not necesary clustering for running the lda algorithm, because it
makes a clustering process itself.

[https://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html]

On Sun, Apr 29, 2012 at 1:11 AM, Aneesha <an...@gmail.com> wrote:

> I create sequential file and create vector for k-means. Is it the same
> input we
> need to use for Latent Dirichlet Allocation????
>
>