You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Benjamin Heilbrunn <be...@gmail.com> on 2011/07/26 13:27:19 UTC

Mahout LDA

Hi,

I'm new to mahout and interested in using it's LDA implementation.

I ran the supplied reuters example and got the topics with their words
as a result.

Now I have 2 questions:

1) How can I display the topic distribution for a (existing) document
from the reuters corpus?

2) How can I compute the topic distribution for a new and unknown document?


Thanks,
Benjamin

Re: Mahout LDA

Posted by Jake Mannix <ja...@gmail.com>.

On Tue, Jul 26, 2011 at 4:27 AM, Benjamin Heilbrunn <be...@gmail.com>wrote:
>
> 1) How can I display the topic distribution for a (existing) document
> from the reuters corpus?
>

There is a sequence file called docTopics in the output directory.  keys are
docIds,
values are VectorWritable.  Use "./bin/mahout vectordump -s <path to
docTopics>"
to print them out.

> 2) How can I compute the topic distribution for a new and unknown document?
>

This isn't hooked into the bin/mahout shell script, but it's an existing
java method:

LDADriver.computeDocumentTopicProbabilities(Configuration conf,
                                                        Path input,
                                                        Path stateIn,
                                                        Path outputPath,
                                                        int numTopics,
                                                        int numWords,
                                                        double
topicSmoothing)

the input path should be a sequencefile with values being VectorWritable
document instances, and stateIn should be the path to the final iteration of
the
topic model of the LDA iteration.  Make sure you used the same dictionary in
creating both the input and the topic model, or else you'll get nonsense.

  -jake