You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mahout.apache.org by co...@apache.org on 2009/06/30 04:35:00 UTC

[CONF] Apache Lucene Mahout: Latent Dirichlet Allocation (page created)

Latent Dirichlet Allocation (MAHOUT) created by David Hall
http://cwiki.apache.org/confluence/display/MAHOUT/Latent+Dirichlet+Allocation

Content:
---------------------------------------------------------------------

h1. Overview

Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning algorithm for automatically and jointly clustering words into "topics" and documents into mixtures of topics, and it has been successfully applied to model change in scientific fields over time (Griffiths and Steyver, 2004; Hall, et al. 2008).

A topic model is, roughly, a hierarchical Bayesian model that associates with each document a probability distribution over
"topics", which are in turn distributions over words. For instance, a topic in a collection of newswire might include words about "sports", such as "baseball", "home run", "player", and a document about steroid use in baseball might include "sports", "drugs", and "politics". Note that the labels "sports", "drugs", and "politics", are post-hoc labels assigned by a human, and that the algorithm itself only assigns associate words with probabilities. The task of parameter estimation in these models is to learn both what these topics are, and which documents employ them in what proportions.

Another way to view a topic model is as a generalization of a mixture model, like [Dirichlet Process Clustering]. Starting from a normal mixture model, in which we have a single global mixture of several distributions, we instead say that _each_ document has its own mixture distribution over the globally shared mixture components. Operationally, in Dirichlet Process Clustering, each document has its own latent variable drawn from a global mixture that specifies which component it belongs to, while in LDA, each word in each document has its own parameter drawn from a document-wide mixture.

The idea is that we use a probabilistic mixture of a number of models that we use to explain some observed data. Each observed data point is assumed to have come from one of the models in the mixture, but we don't know which. The way we deal with that is to use a so-called latent parameter which specifies which model each data point came from.

h1. Invocation and Usage

Mahout's implementation of LDA operates on a collection of SparseVectors of word counts. These word counts should be non-negative integers, though things will --probably--work fine if you use non-negative reals. (Note that the probabilistic model doesn't make since if you do!) To create these vectors, it's recommended that you follow the instructions in [Creating Vectors From Text], making sure to use TF and not TFIDF as the scorer.

Invocation takes the form:

{{mvn exec:java -Dexec.mainClass=org.apache.clustering.lda.LDADriver -Dexec.args="<input vectors> <working directory> <number of topics> <number of words in the vocabulary> <topic smoothing> <max iterations> <num reducers>"}}

topic smoothing should generally be about 50/K, where K is the number of topics. The number of words in the vocabulary can be an upper bound, though it shouldn't be too high (for memory concerns).

Choosing the number of topics is more art than science, and it's recommended that you try several values.

h1. Parameter Estimation

We use mean field variational inference to estimate the models. Variational inference can be thought of as a generalization of EM for hierarchical Bayesian models. The E-Step takes the form of, for each document, inferring the posterior probability of each topic for each word in each document. We then take the sufficient statistics and emit them in the form of (log) pseudo-counts for each word in each topic. The M-Step is simply to sum these together and (log) normalize them so that we have a distribution over the entire vocabulary of the corpus for each topic.

In implementation, the E-Step is implemented in the Map, and the M-Step is executed in the reduce step, with the final normalization happening as a post-processing step.

h1. References

[David M. Blei, Andrew Y. Ng, Michael I. Jordan, John Lafferty. 2003. Latent Dirichlet Allocation. JMLR.|
www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf]

---------------------------------------------------------------------
CONFLUENCE INFORMATION
This message is automatically generated by Confluence

Unsubscribe or edit your notifications preferences
http://cwiki.apache.org/confluence/users/viewnotifications.action

If you think it was sent incorrectly contact one of the administrators
http://cwiki.apache.org/confluence/administrators.action

If you want more information on Confluence, or have a bug to report see
http://www.atlassian.com/software/confluence