You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "David Hall (JIRA)" <ji...@apache.org> on 2009/06/16 22:51:07 UTC
[jira] Updated: (MAHOUT-123) Implement Latent Dirichlet Allocation

     [ https://issues.apache.org/jira/browse/MAHOUT-123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Hall updated MAHOUT-123:
------------------------------

        Fix Version/s: 0.2
    Affects Version/s: 0.2
               Status: Patch Available  (was: Open)

This is a roughcut implementation. Not ready to go yet. I've been waiting on MAHOUT-126 because it seems like the way to create the Vectors I need. Or perhaps there's a better way.

Basic approach follows the Dirichlet implementation. There is a driver class (LDA Driver) which runs K mapreduces, and a Mapper and a Reducer. We also have an Inferencer, which is what the Mapper uses to compute expected sufficient statistics. A document is just a V-dimensional sparse vector of word counts.

Map: Perform Inference on each document (~ E-step) and output log probabilities of p(word|topic)
Reduce: logSum the input log probabilities (~ M-Step), and output the result.

Loop: use the results of the reduce as the log probabilities for the map.

Remaining:
1) Actually run the thing
2) Number-of-non-zero elements in a sparse vector. Is that staying "size"?
3) Allow for computing of likelihood to determine when we're done.
4) What's the status of serializing as sparse vector and reading as a dense vector? Is that going to happen?
5) Find a fun data set to bundle...
6) Convenience method for running just inference on a set of documents and outputting MAP estimates of word probabilities.

> Implement Latent Dirichlet Allocation
> -------------------------------------
>
>                 Key: MAHOUT-123
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-123
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>    Affects Versions: 0.2
>            Reporter: David Hall
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: lda.patch
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> (For GSoC)
> Abstract:
> Latent Dirichlet Allocation (Blei et al, 2003) is a powerful learning
> algorithm for automatically and jointly clustering words into "topics"
> and documents into mixtures of topics, and it has been successfully
> applied to model change in scientific fields over time (Griffiths and
> Steyver, 2004; Hall, et al. 2008). In this project, I propose to
> implement a distributed variant of Latent Dirichlet Allocation using
> MapReduce, and, time permitting, to investigate extensions of LDA and
> possibly more efficient algorithms for distributed inference.
> Detailed Description:
> A topic model is, roughly, a hierarchical Bayesian model that
> associates with each document a probability distribution over
> "topics", which are in turn distributions over words. For instance, a
> topic in a collection of newswire might include words about "sports",
> such as "baseball", "home run", "player", and a document about steroid
> use in baseball might include "sports", "drugs", and "politics". Note
> that the labels "sports", "drugs", and "politics", are post-hoc labels
> assigned by a human, and that the algorithm itself only assigns
> associate words with probabilities. The task of parameter estimation
> in these models is to learn both what these topics are, and which
> documents employ them in what proportions.
> One of the promises of unsupervised learning algorithms like Latent
> Dirichlet Allocation (LDA; Blei et al, 2003) is the ability to take a
> massive collections of documents and condense them down into a
> collection of easily understandable topics. However, all available
> open source implementations of LDA and related topics models are not
> distributed, which hampers their utility. This project seeks to
> correct this shortcoming.
> In the literature, there have been several proposals for paralellzing
> LDA. Newman, et al (2007) proposed to create an "approximate" LDA in
> which each processors gets its own subset of the documents to run
> Gibbs sampling over. However, Gibbs sampling is slow and stochastic by
> its very nature, which is not advantageous for repeated runs. Instead,
> I propose to follow Nallapati, et al. (2007) and use a variational
> approximation that is fast and non-random.
> References:
> David M. Blei, J McAuliffe. Supervised Topic Models. NIPS, 2007.
> David M. Blei , Andrew Y. Ng , Michael I. Jordan, Latent dirichlet
> allocation, The Journal of Machine Learning Research, 3, p.993-1022,
> 3/1/2003
> T. L. Griffiths and M. Steyvers. Finding scientiﬁc topics. Proc Natl
> Acad Sci U S A, 101 Suppl 1: 5228-5235, April 2004.
> David LW Hall, Daniel Jurafsky, and Christopher D. Manning. Studying
> the History of Ideas Using Topic Models. EMNLP, Honolulu, 2008.
> Ramesh Nallapati, William Cohen, John Lafferty, Parallelized
> variational EM for Latent Dirichlet Allocation: An experimental
> evaluation of speed and scalability, ICDM workshop on high performance
> data mining, 2007.
> Newman, D., Asuncion, A., Smyth, P., & Welling, M. Distributed
> Inference for Latent Dirichlet Allocation. NIPS, 2007.
> Xuerui Wang , Andrew McCallum, Topics over time: a non-Markov
> continuous-time model of topical trends. KDD, 2006
> Wolfe, J., Haghighi, A, and Klein, D. Fully distributed EM for very
> large datasets. ICML, 2008.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.