You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Daniel Kluesing <da...@ilike-inc.com> on 2008/05/27 02:49:08 UTC

LDA [was RE: Taste on Mahout]

(Hijacking the thread to discuss ways to implement LDA) 

Had you seen http://books.nips.cc/papers/files/nips20/NIPS2007_0672.pdf
?

Their hierarchical distributed LDA formulation uses gibbs sampling and
fits into mapreduce.

http://www.cs.berkeley.edu/~jawolfe/pubs/08-icml-em.pdf gives a
mapreduce formulation for the variational EM method. 

I'm still chewing on them, but my first impression is that the EM
approach would give better performance on bigger data sets. Opposing
views welcome.

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Sunday, May 25, 2008 12:11 AM
To: mahout-dev@lucene.apache.org
Subject: Re: Taste on Mahout

Buntine and Jakulin provide even farther ranging common structure
between LDA, pLSI, LSI and non-negative matrix factorizations in
http://citeseer.ist.psu.edu/750239.html

They also provide a much simpler algorithm for estimating parameters
that is (to my mind) simpler to implement in map-reduce than the
variational optimization of Jordan et al.

I am very interested in helping with a good LDA map-reduce
implementation.
My time constraints limit how much actual code I can generate for the
implementation, but I would still like to help in whatever small way I
might be able to .

On Sat, May 24, 2008 at 2:27 PM, Daniel Kluesing <da...@ilike-inc.com>
wrote:

> LDA is a proper 'generative model', while pLSI 'fakes' being a 
> generative model. From a generative model you have the full 
> probability distribution of all variables, this matters when you're 
> working with new unseen data.
>
> You may find 
>
http://www.cs.bham.ac.uk/~axk/sigir2003_mgak.pdf<http://www.cs.bham.ac.u
k/%7Eaxk/sigir2003_mgak.pdf>a good comparison, says that pLSI is a
special case of LDA.
>
> If anybody is working on/interesting working on a mapreduce LDA 
> implementation for mahout, I'd love to chat with you.
>
>
> -----Original Message-----
> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> Sent: Thursday, May 22, 2008 5:31 AM
> To: mahout-dev@lucene.apache.org
> Subject: RE: Taste on Mahout
>
> Hey Ted,
>        I read the paper on LDA
> (http://citeseer.ist.psu.edu/blei03latent.html) and I have to admit I 
> could not understand how LDA would be any different than PLSI for the 
> problem setting that I have (user-click history for various users and 
> urls). May be its my limited statistical knowledge and ML background 
> but I am making best efforts to learn things as they come along.
>
> I found the notations to be quite complex and it would be nice if you 
> could point me to a source offering simpler explanation of LDA model 
> parameters and their estimation methods as after reading the paper I 
> could not map those methods into my problem setting.
>
> Since I already have some understading of PLSI and Expectation 
> Maximization, an explanation describing the role of additional model 
> parameters and their estimation method would suffice. May be that's 
> something you could help me with offline.
>
> Thanks
> -Ankur
>
>
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Wednesday, May 21, 2008 10:24 PM
> To: mahout-dev@lucene.apache.org
> Subject: Re: Taste on Mahout
>
> My suggestion is to build a class of probabilistic models of what 
> people click on.  You can build some number of models as necessary to 
> describe your users' histories well.
>
> These model will give you the answers you need.
>
> I can talk this evening a bit about how to do this.  If you want to 
> read up on it ahead of time, take a look at 
> http://citeseer.ist.psu.edu/750239.htmland
> http://citeseer.ist.psu.edu/blei03latent.html
>
> (hint: consider each person a document and a thing to be clicked as a
> word)
>
> On Wed, May 21, 2008 at 4:36 AM, Goel, Ankur <An...@corp.aol.com>
> wrote:
>
> > Hey Sean,
> >          Thanks for the suggestions. In my case the data-set os only

> > going to tell me if the useer clicked on a particualar item. So lets

> > say there are 10,000 items a user might only have clicked 20 - 30 
> > items. I was thinking more on the lines of building an item 
> > similarity
>
> > table by comparing each item with every other item and retaining 
> > only 100 top items decayed by time.
> >
> > So a recommender for a user would use his recent browsing history to

> > figure out top 10 or 20 most similar items.
> >
> > The approach is documented in Toby Segaran's "Collective
Intelligence"
> > book and looks simple to implement even though it is costly since 
> > every item needs to be compared with every other item. This can be 
> > parallelized in way that for M items in a cluster of N machines, 
> > each node has to compare M/N items to M items. Since the data-set is

> > going to sparse (no. of items having common users), I believe this 
> > would'nt be overwhelming for the cluster.
> >
> > The other approach that I am thinking to reduce the computation cost

> > is to use a clustering algorithm like K-Means that's available in 
> > Mahout to cluster similar user/items together and then use 
> > clustering information to make recommendations.
> >
> > Any suggestions?
> >
> > Thanks
> > -Ankur
> >
> >
> > -----Original Message-----
> > From: Sean Owen [mailto:srowen@gmail.com]
> > Sent: Tuesday, May 20, 2008 9:37 PM
> > To: mahout-dev@lucene.apache.org; Goel, Ankur
> > Subject: Re: Taste on Mahout
> >
> > + Ankur directly, since I am not sure you are on the dev list.
> >
> > On Tue, May 20, 2008 at 12:06 PM, Sean Owen <sr...@gmail.com>
wrote:
> > > All of the algorithms assume a world where you have a continuous 
> > > range
> >
> > > of ratings from users for items. Obviously a binary yes/no rating 
> > > can be mapped into that trivially -- 1 and -1 for example. This 
> > > causes some issues, most notably for corrletion-based recommenders

> > > where the correlation can be undefined between two items/users in 
> > > special cases that arise from this kind of input -- for example if

> > > we overlap in rating 3 items and I voted "yes" for all 3, then no 
> > > correlation can be
> >
> > > defined.
> > >
> > > Slope one doesn't run into this particular mathematical wrinkle.
> > >
> > > Also, methods like estimatePreference() are not going to give you 
> > > estimates that are always 1 or -1. Again, you could map this back 
> > > onto
> > > 1 / -1 by rounding or something, just something to note.
> > >
> > > So, in general it will be better if you can map whatever input you

> > > have onto a larger range of input. You will feed more information 
> > > in, in this way, as well. For example, maybe you call a recent
"yes"
> > > rating a +2, and a recent "no" a -2, and others +1 and -1.
> > >
> > >
> > > The part of slope one that parallelizes very well is the computing

> > > of the item-item diffs. No I have not written this yet.
> > >
> > >
> > > I have committed a first cut at a framework for computing 
> > > recommendations in parallel for any recommender. Dig in to 
> > > org.apache.mahout.cf.taste.impl.hadoop. In general, none of the 
> > > existing recommenders can be parallelized, because they generally 
> > > need
> >
> > > access to all the data to produce any recommendation.
> > >
> > > But, we can take partial advantage of Hadoop by simply 
> > > parallelizing
>
> > > the computation of recommendations for many users across multiple 
> > > identical recommender instances. Better than nothing. In this 
> > > situation, one of the map or reduce phase is trivial.
> > >
> > > That is what I have committed so far and it works, locally. I am 
> > > in the middle of figuring out how to write it for real use on a 
> > > remote Hadoop cluster, and how I would go about testing that!
> > >
> > > Do we have any test bed available?
> > >
> > >
> > >
> > > On Tue, May 20, 2008 at 7:47 AM, Goel, Ankur 
> > > <An...@corp.aol.com>
> > wrote:
> > >> I just realized after going through the wikipedia that slope one 
> > >> is
>
> > >> applicable when you have ratings for the items.
> > >> In my case, I would be simply working with binary data (Item was 
> > >> clicked or not-clicked by user) using Tanimoto coefficient to 
> > >> calculate item similarity.
> > >> The idea is to capture the simple intuition "What items have been

> > >> visited most along with this item".
> > >>
> > >>
> > >> -----Original Message-----
> > >> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> > >> Sent: Tuesday, May 20, 2008 2:51 PM
> > >> To: mahout-dev@lucene.apache.org
> > >> Subject: RE: Taste on Mahout
> > >>
> > >>
> > >> Hey Sean,
> > >>       I actually plan to use slope-one to start with since
> > >> - Its simple and known to work well.
> > >> - Can be parallelized nicely into the Map-Reduce style.
> > >> I also plan to use Tanimoto coefficient for item-item diffs.
> > >>
> > >> Do we have something on slope-one already in Taste as a part of
> > Mahout ?
> > >>
> > >> At the moment I am going through the available documentation on 
> > >> Taste
> >
> > >> and code that's present in Mahout.
> > >>
> > >> Your suggestions would be greatly appreciated.
> > >>
> > >> Thanks
> > >> -Ankur
> > >>
> > >> -----Original Message-----
> > >> From: Sean Owen [mailto:srowen@gmail.com]
> > >> Sent: Tuesday, April 29, 2008 11:09 PM
> > >> To: mahout-dev@lucene.apache.org; Goel, Ankur
> > >> Subject: Re: Taste on Mahout
> > >>
> > >> I have some Hadoop code mostly ready to go for Taste.
> > >>
> > >> The first thing to do is let you generate recommendations for all

> > >> your users via Hadoop. Unfortunately none of the recommenders 
> > >> truly
>
> > >> parallelize in the way that MapReduce needs it to -- you need all

> > >> data to compute any recommendation really -- but you can at least

> > >> get
> >
> > >> paralellization out of this. You can use the framework to run n 
> > >> recommenders, each computing 1/nth of all recommendations.
> > >>
> > >> The next application is specific to slope-one. Computing the 
> > >> item-item diffs is exactly the kind of thing that MapReduce is 
> > >> good
>
> > >> for, so, writing a Hadoop job to do this seems like a no-brainer.
> > >>
> > >> On Tue, Apr 29, 2008 at 11:14 AM, Goel, Ankur 
> > >> <An...@corp.aol.com>
> > >> wrote:
> > >>> Hi Folks,
> > >>>       What's the status of hadoopifying Taste on Mahout ?
> > >>>  What's been done and what is in progress/pending ?
> > >>>
> > >>>  I am looking using a scalable version of Taste for my project.
> > >>>  So basically trying to figure out what's already done and where

> > >>> I
>
> > >>> can pitch in.
> > >>>
> > >>>  Thanks
> > >>>  -Ankur
> > >>>
> > >>
> > >
> >
>
>
>
> --
> ted
>



--
ted

RE: LDA [was RE: Taste on Mahout]

Posted by Daniel Kluesing <da...@ilike-inc.com>.

Just a note that LDA and CRFs are two different algorithms. They both
fall under the general class of graphical models but otherwise solve
different problems.

LDA is trained on un-labeled, unordered data using a bag-of-words
assumption, CRFs are trained on labeled, sequential data with markov
assumptions. (CRFs are sorta a generalization of HMMs)

But CRFs are neat in their own right. Biology folks would get excited
with a good CRF implementation.

-----Original Message-----
From: Robin Anil [mailto:robin.anil@gmail.com] 
Sent: Saturday, June 07, 2008 4:36 AM
To: mahout-dev@lucene.apache.org
Subject: Re: LDA [was RE: Taste on Mahout]

Hi,.
     There some LDA/CRF implementations available online. Might prove
useful when writing the code

* GibbsLDA++ <http://gibbslda.sourceforge.net/>*: GibbsLDA++: A C/C++
Implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling
for parameter estimation and inference. GibbsLDA++ is fast and is
designed to analyze hidden/latent topic structures of large-scale (text)
data collections.

* CRFTagger <http://crftagger.sourceforge.net/> *: A Java-based
Conditional Random Fields Part-of-Speech (POS) Tagger for English. The
model was trained on sections 01..24 of WSJ corpus and using section 00
as the development test set (accuracy of 97.00%). Tagging speed: 500
sentences / second.

* CRFChunker <http://crfchunker.sourceforge.net/> *: A Java-based
Conditional Random Fields Phrase Chunker (Phrase Chunking Tool) for
English.
The model was trained on sections 01..24 of WSJ corpus and using section
00 as the development test set (F1-score of 95.77). Chunking speed: 700
sentences / second.

* JTextPro <http://jtextpro.sourceforge.net/>*: A Java-based text
processing tool that includes sentence boundary detection (using maximum
entropy classifier), word tokenization (following Penn convention),
part-of-speech tagging (using CRFTagger), and phrase chunking (using
CRFChunker).

*JWebPro <http://jwebpro.sourceforge.net/>*: A Java-based tool that can
interact with Google search via Google Web APIs and then process the
returned Web documents in a couple of ways. The outputs of JWebPro can
serve as inputs for natural language processing, information retrieval,
information extraction, Web data mining, online social network
extraction/analysis, and ontology development applications.

* JVnSegmenter <http://jvnsegmenter.sourceforge.net/>*: A Java-based and
open-source Vietnamese word segmentation tool. The segmentation model in
this tool was trained on about 8,000 labeled sentences using FlexCRFs.
It would be useful for Vietnamese NLP community.
*FlexCRFs: Flexible Conditional Random Fields* (Including PCRFs - a
parallel version of FlexCRFs)  http://flexcrfs.sourceforge.net/

CRF++: Yet Another CRF toolkit *http://flexcrfs.sourceforge.net/*
Robin
On Thu, Jun 5, 2008 at 9:59 PM, Ted Dunning <te...@gmail.com>
wrote:

> The buntine and jakulin paper is also useful reading.  I would avoid 
> fancy stuff like the powell rao-ization to start.
>
> http://citeseer.ist.psu.edu/750239.html
>
> The gibb's sampling approach is, at its heart, very simple in that 
> most of the math devolves into sampling discrete hidden variables from

> simple distributions and then counting the results as if they were
observed.
>
> On Thu, Jun 5, 2008 at 5:49 AM, Goel, Ankur <An...@corp.aol.com>
> wrote:
>
> > It draws reference from Java implementation - 
> > http://www.arbylon.net/projects/LdaGibbsSampler.java
> > which is a single class version of LDA using gibbs sampling with 
> > slightly better code documentation.
> > I am trying to understand the code while reading the paper you 
> > suggested
> > -
> > "Distributed Inference for Latent Drichlet Allocation".
> >
> > -----Original Message-----
> > From: Daniel Kluesing [mailto:daniel@ilike-inc.com]
> > Sent: Wednesday, June 04, 2008 8:31 PM
> > To: mahout-dev@lucene.apache.org
> > Subject: RE: LDA [was RE: Taste on Mahout]
> >
> > Ted may have a better one, but in my quick poking around at things 
> > http://gibbslda.sourceforge.net/ looks to be a good implementation 
> > of the Gibbs sampling approach.
> >
> > -----Original Message-----
> > From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> > Sent: Wednesday, June 04, 2008 4:58 AM
> > To: mahout-dev@lucene.apache.org
> > Subject: RE: LDA [was RE: Taste on Mahout]
> >
> > Ted, Do you have a sequential version of LDA implementation that can

> > be used for reference ?
> > If yes, can you please post it on Jira ? Should we open a new Jira 
> > or use MAHOUT-30 for this ?
> >
> > -----Original Message-----
> > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > Sent: Tuesday, May 27, 2008 11:50 AM
> > To: mahout-dev@lucene.apache.org
> > Subject: Re: LDA [was RE: Taste on Mahout]
> >
> > Chris Bishop's book has a very clear exposition of the relationship 
> > between the variational techniques and EM.  Very good reading.
> >
> > On Mon, May 26, 2008 at 10:13 PM, Goel, Ankur 
> > <An...@corp.aol.com>
> > wrote:
> >
> > > Daniel/Ted,
> > >      Thanks for the interesting pointers to more information on 
> > > LDA and EM.
> > > I am going through the docs to visualize and understand how LDA 
> > > approach would work for my specific case.
> > >
> > > Once I have some idea, I can volunteer to work on the Map-Reduce 
> > > side of
> > >
> > > thngs as this is something that will benefit both my project and 
> > > the community.
> > >
> > > Looking forward to share more ideas/information on this :-)
> > >
> > > Regards
> > > -Ankur
> > >
> > > -----Original Message-----
> > > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > > Sent: Tuesday, May 27, 2008 6:59 AM
> > > To: mahout-dev@lucene.apache.org
> > > Subject: Re: LDA [was RE: Taste on Mahout]
> > >
> > > Those are both new to me.  Both look interesting.  My own 
> > > experience is that the simplicity of the Gibb's sampling makes it 
> > > very much more attractive for implementation.  Also, since it is 
> > > (nearly) trivially parallelizable, it is more likely we will get a

> > > useful implementation right off the bat.
> > >
> > > On Mon, May 26, 2008 at 5:49 PM, Daniel Kluesing 
> > > <da...@ilike-inc.com>
> > > wrote:
> > >
> > > > (Hijacking the thread to discuss ways to implement LDA)
> > > >
> > > > Had you seen
> > > > http://books.nips.cc/papers/files/nips20/NIPS2007_0672.pdf
> > > > ?
> > > >
> > > > Their hierarchical distributed LDA formulation uses gibbs 
> > > > sampling and
> > >
> > > > fits into mapreduce.
> > > >
> > > > http://www.cs.berkeley.edu/~jawolfe/pubs/08-icml-em.pdf<http://w
> > > > ww.cs.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>
> <http://www.cs.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>
> > <http://www.c
> > > > s.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>
> > > <http://www.cs.
> > > > berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>gives a mapreduce
> > > formulation for the variational EM method.
> > > >
> > > > I'm still chewing on them, but my first impression is that the 
> > > > EM approach would give better performance on bigger data sets. 
> > > > Opposing
> >
> > > > views welcome.
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > ted
> >
>
>
>
> --
> ted
>

--
Robin Anil
4th Year Dual Degree Student
Department of Computer Science & Engineering IIT Kharagpur

------------------------------------------------------------------------
--------------------
techdigger.wordpress.com
A discursive take on the world around us

www.minekey.com
You Might Like This

www.ithink.com
Express Yourself

Re: LDA [was RE: Taste on Mahout]

Posted by Robin Anil <ro...@gmail.com>.

Hi,.
     There some LDA/CRF implementations available online. Might prove useful
when writing the code

* GibbsLDA++ <http://gibbslda.sourceforge.net/>*: GibbsLDA++: A C/C++
Implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling for
parameter estimation and inference. GibbsLDA++ is fast and is designed to
analyze hidden/latent topic structures of large-scale (text) data
collections.

* CRFTagger <http://crftagger.sourceforge.net/> *: A Java-based Conditional
Random Fields Part-of-Speech (POS) Tagger for English. The model was trained
on sections 01..24 of WSJ corpus and using section 00 as the development
test set (accuracy of 97.00%). Tagging speed: 500 sentences / second.

* CRFChunker <http://crfchunker.sourceforge.net/> *: A Java-based
Conditional Random Fields Phrase Chunker (Phrase Chunking Tool) for English.
The model was trained on sections 01..24 of WSJ corpus and using section 00
as the development test set (F1-score of 95.77). Chunking speed: 700
sentences / second.

* JTextPro <http://jtextpro.sourceforge.net/>*: A Java-based text processing
tool that includes sentence boundary detection (using maximum entropy
classifier), word tokenization (following Penn convention), part-of-speech
tagging (using CRFTagger), and phrase chunking (using CRFChunker).

*JWebPro <http://jwebpro.sourceforge.net/>*: A Java-based tool that can
interact with Google search via Google Web APIs and then process the
returned Web documents in a couple of ways. The outputs of JWebPro can serve
as inputs for natural language processing, information retrieval,
information extraction, Web data mining, online social network
extraction/analysis, and ontology development applications.

* JVnSegmenter <http://jvnsegmenter.sourceforge.net/>*: A Java-based and
open-source Vietnamese word segmentation tool. The segmentation model in
this tool was trained on about 8,000 labeled sentences using FlexCRFs. It
would be useful for Vietnamese NLP community.
*FlexCRFs: Flexible Conditional Random Fields* (Including PCRFs - a parallel
version of FlexCRFs)  http://flexcrfs.sourceforge.net/

CRF++: Yet Another CRF toolkit *http://flexcrfs.sourceforge.net/*
Robin
On Thu, Jun 5, 2008 at 9:59 PM, Ted Dunning <te...@gmail.com> wrote:

> The buntine and jakulin paper is also useful reading.  I would avoid fancy
> stuff like the powell rao-ization to start.
>
> http://citeseer.ist.psu.edu/750239.html
>
> The gibb's sampling approach is, at its heart, very simple in that most of
> the math devolves into sampling discrete hidden variables from simple
> distributions and then counting the results as if they were observed.
>
> On Thu, Jun 5, 2008 at 5:49 AM, Goel, Ankur <An...@corp.aol.com>
> wrote:
>
> > It draws reference from Java implementation -
> > http://www.arbylon.net/projects/LdaGibbsSampler.java
> > which is a single class version of LDA using gibbs sampling with
> > slightly better code documentation.
> > I am trying to understand the code while reading the paper you suggested
> > -
> > "Distributed Inference for Latent Drichlet Allocation".
> >
> > -----Original Message-----
> > From: Daniel Kluesing [mailto:daniel@ilike-inc.com]
> > Sent: Wednesday, June 04, 2008 8:31 PM
> > To: mahout-dev@lucene.apache.org
> > Subject: RE: LDA [was RE: Taste on Mahout]
> >
> > Ted may have a better one, but in my quick poking around at things
> > http://gibbslda.sourceforge.net/ looks to be a good implementation of
> > the Gibbs sampling approach.
> >
> > -----Original Message-----
> > From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> > Sent: Wednesday, June 04, 2008 4:58 AM
> > To: mahout-dev@lucene.apache.org
> > Subject: RE: LDA [was RE: Taste on Mahout]
> >
> > Ted, Do you have a sequential version of LDA implementation that can be
> > used for reference ?
> > If yes, can you please post it on Jira ? Should we open a new Jira or
> > use MAHOUT-30 for this ?
> >
> > -----Original Message-----
> > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > Sent: Tuesday, May 27, 2008 11:50 AM
> > To: mahout-dev@lucene.apache.org
> > Subject: Re: LDA [was RE: Taste on Mahout]
> >
> > Chris Bishop's book has a very clear exposition of the relationship
> > between the variational techniques and EM.  Very good reading.
> >
> > On Mon, May 26, 2008 at 10:13 PM, Goel, Ankur <An...@corp.aol.com>
> > wrote:
> >
> > > Daniel/Ted,
> > >      Thanks for the interesting pointers to more information on LDA
> > > and EM.
> > > I am going through the docs to visualize and understand how LDA
> > > approach would work for my specific case.
> > >
> > > Once I have some idea, I can volunteer to work on the Map-Reduce side
> > > of
> > >
> > > thngs as this is something that will benefit both my project and the
> > > community.
> > >
> > > Looking forward to share more ideas/information on this :-)
> > >
> > > Regards
> > > -Ankur
> > >
> > > -----Original Message-----
> > > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > > Sent: Tuesday, May 27, 2008 6:59 AM
> > > To: mahout-dev@lucene.apache.org
> > > Subject: Re: LDA [was RE: Taste on Mahout]
> > >
> > > Those are both new to me.  Both look interesting.  My own experience
> > > is that the simplicity of the Gibb's sampling makes it very much more
> > > attractive for implementation.  Also, since it is (nearly) trivially
> > > parallelizable, it is more likely we will get a useful implementation
> > > right off the bat.
> > >
> > > On Mon, May 26, 2008 at 5:49 PM, Daniel Kluesing
> > > <da...@ilike-inc.com>
> > > wrote:
> > >
> > > > (Hijacking the thread to discuss ways to implement LDA)
> > > >
> > > > Had you seen
> > > > http://books.nips.cc/papers/files/nips20/NIPS2007_0672.pdf
> > > > ?
> > > >
> > > > Their hierarchical distributed LDA formulation uses gibbs sampling
> > > > and
> > >
> > > > fits into mapreduce.
> > > >
> > > > http://www.cs.berkeley.edu/~jawolfe/pubs/08-icml-em.pdf<http://www.cs.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>
> <http://www.cs.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>
> > <http://www.c
> > > > s.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>
> > > <http://www.cs.
> > > > berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>gives a mapreduce
> > > formulation for the variational EM method.
> > > >
> > > > I'm still chewing on them, but my first impression is that the EM
> > > > approach would give better performance on bigger data sets. Opposing
> >
> > > > views welcome.
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > ted
> >
>
>
>
> --
> ted
>



-- 
Robin Anil
4th Year Dual Degree Student
Department of Computer Science & Engineering
IIT Kharagpur

--------------------------------------------------------------------------------------------
techdigger.wordpress.com
A discursive take on the world around us

www.minekey.com
You Might Like This

www.ithink.com
Express Yourself

Re: LDA [was RE: Taste on Mahout]

Posted by Ted Dunning <te...@gmail.com>.

The buntine and jakulin paper is also useful reading.  I would avoid fancy
stuff like the powell rao-ization to start.

http://citeseer.ist.psu.edu/750239.html

The gibb's sampling approach is, at its heart, very simple in that most of
the math devolves into sampling discrete hidden variables from simple
distributions and then counting the results as if they were observed.

On Thu, Jun 5, 2008 at 5:49 AM, Goel, Ankur <An...@corp.aol.com> wrote:

> It draws reference from Java implementation -
> http://www.arbylon.net/projects/LdaGibbsSampler.java
> which is a single class version of LDA using gibbs sampling with
> slightly better code documentation.
> I am trying to understand the code while reading the paper you suggested
> -
> "Distributed Inference for Latent Drichlet Allocation".
>
> -----Original Message-----
> From: Daniel Kluesing [mailto:daniel@ilike-inc.com]
> Sent: Wednesday, June 04, 2008 8:31 PM
> To: mahout-dev@lucene.apache.org
> Subject: RE: LDA [was RE: Taste on Mahout]
>
> Ted may have a better one, but in my quick poking around at things
> http://gibbslda.sourceforge.net/ looks to be a good implementation of
> the Gibbs sampling approach.
>
> -----Original Message-----
> From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
> Sent: Wednesday, June 04, 2008 4:58 AM
> To: mahout-dev@lucene.apache.org
> Subject: RE: LDA [was RE: Taste on Mahout]
>
> Ted, Do you have a sequential version of LDA implementation that can be
> used for reference ?
> If yes, can you please post it on Jira ? Should we open a new Jira or
> use MAHOUT-30 for this ?
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Tuesday, May 27, 2008 11:50 AM
> To: mahout-dev@lucene.apache.org
> Subject: Re: LDA [was RE: Taste on Mahout]
>
> Chris Bishop's book has a very clear exposition of the relationship
> between the variational techniques and EM.  Very good reading.
>
> On Mon, May 26, 2008 at 10:13 PM, Goel, Ankur <An...@corp.aol.com>
> wrote:
>
> > Daniel/Ted,
> >      Thanks for the interesting pointers to more information on LDA
> > and EM.
> > I am going through the docs to visualize and understand how LDA
> > approach would work for my specific case.
> >
> > Once I have some idea, I can volunteer to work on the Map-Reduce side
> > of
> >
> > thngs as this is something that will benefit both my project and the
> > community.
> >
> > Looking forward to share more ideas/information on this :-)
> >
> > Regards
> > -Ankur
> >
> > -----Original Message-----
> > From: Ted Dunning [mailto:ted.dunning@gmail.com]
> > Sent: Tuesday, May 27, 2008 6:59 AM
> > To: mahout-dev@lucene.apache.org
> > Subject: Re: LDA [was RE: Taste on Mahout]
> >
> > Those are both new to me.  Both look interesting.  My own experience
> > is that the simplicity of the Gibb's sampling makes it very much more
> > attractive for implementation.  Also, since it is (nearly) trivially
> > parallelizable, it is more likely we will get a useful implementation
> > right off the bat.
> >
> > On Mon, May 26, 2008 at 5:49 PM, Daniel Kluesing
> > <da...@ilike-inc.com>
> > wrote:
> >
> > > (Hijacking the thread to discuss ways to implement LDA)
> > >
> > > Had you seen
> > > http://books.nips.cc/papers/files/nips20/NIPS2007_0672.pdf
> > > ?
> > >
> > > Their hierarchical distributed LDA formulation uses gibbs sampling
> > > and
> >
> > > fits into mapreduce.
> > >
> > > http://www.cs.berkeley.edu/~jawolfe/pubs/08-icml-em.pdf<http://www.cs.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>
> <http://www.c
> > > s.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>
> > <http://www.cs.
> > > berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>gives a mapreduce
> > formulation for the variational EM method.
> > >
> > > I'm still chewing on them, but my first impression is that the EM
> > > approach would give better performance on bigger data sets. Opposing
>
> > > views welcome.
> > >
> > >
> >
>
>
>
> --
> ted
>



-- 
ted

RE: LDA [was RE: Taste on Mahout]

Posted by "Goel, Ankur" <An...@corp.aol.com>.

It draws reference from Java implementation -
http://www.arbylon.net/projects/LdaGibbsSampler.java
which is a single class version of LDA using gibbs sampling with
slightly better code documentation.
I am trying to understand the code while reading the paper you suggested
- 
"Distributed Inference for Latent Drichlet Allocation".

-----Original Message-----
From: Daniel Kluesing [mailto:daniel@ilike-inc.com] 
Sent: Wednesday, June 04, 2008 8:31 PM
To: mahout-dev@lucene.apache.org
Subject: RE: LDA [was RE: Taste on Mahout]

Ted may have a better one, but in my quick poking around at things
http://gibbslda.sourceforge.net/ looks to be a good implementation of
the Gibbs sampling approach.  

-----Original Message-----
From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com]
Sent: Wednesday, June 04, 2008 4:58 AM
To: mahout-dev@lucene.apache.org
Subject: RE: LDA [was RE: Taste on Mahout]

Ted, Do you have a sequential version of LDA implementation that can be
used for reference ?
If yes, can you please post it on Jira ? Should we open a new Jira or
use MAHOUT-30 for this ?

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com]
Sent: Tuesday, May 27, 2008 11:50 AM
To: mahout-dev@lucene.apache.org
Subject: Re: LDA [was RE: Taste on Mahout]

Chris Bishop's book has a very clear exposition of the relationship
between the variational techniques and EM.  Very good reading.

On Mon, May 26, 2008 at 10:13 PM, Goel, Ankur <An...@corp.aol.com>
wrote:

> Daniel/Ted,
>      Thanks for the interesting pointers to more information on LDA 
> and EM.
> I am going through the docs to visualize and understand how LDA 
> approach would work for my specific case.
>
> Once I have some idea, I can volunteer to work on the Map-Reduce side 
> of
>
> thngs as this is something that will benefit both my project and the 
> community.
>
> Looking forward to share more ideas/information on this :-)
>
> Regards
> -Ankur
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Tuesday, May 27, 2008 6:59 AM
> To: mahout-dev@lucene.apache.org
> Subject: Re: LDA [was RE: Taste on Mahout]
>
> Those are both new to me.  Both look interesting.  My own experience 
> is that the simplicity of the Gibb's sampling makes it very much more 
> attractive for implementation.  Also, since it is (nearly) trivially 
> parallelizable, it is more likely we will get a useful implementation 
> right off the bat.
>
> On Mon, May 26, 2008 at 5:49 PM, Daniel Kluesing 
> <da...@ilike-inc.com>
> wrote:
>
> > (Hijacking the thread to discuss ways to implement LDA)
> >
> > Had you seen
> > http://books.nips.cc/papers/files/nips20/NIPS2007_0672.pdf
> > ?
> >
> > Their hierarchical distributed LDA formulation uses gibbs sampling 
> > and
>
> > fits into mapreduce.
> >
> > http://www.cs.berkeley.edu/~jawolfe/pubs/08-icml-em.pdf<http://www.c
> > s.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>
> <http://www.cs.
> > berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>gives a mapreduce
> formulation for the variational EM method.
> >
> > I'm still chewing on them, but my first impression is that the EM 
> > approach would give better performance on bigger data sets. Opposing

> > views welcome.
> >
> >
>



--
ted

RE: LDA [was RE: Taste on Mahout]

Posted by Daniel Kluesing <da...@ilike-inc.com>.

Ted may have a better one, but in my quick poking around at things
http://gibbslda.sourceforge.net/ looks to be a good implementation of
the Gibbs sampling approach.  

-----Original Message-----
From: Goel, Ankur [mailto:Ankur.Goel@corp.aol.com] 
Sent: Wednesday, June 04, 2008 4:58 AM
To: mahout-dev@lucene.apache.org
Subject: RE: LDA [was RE: Taste on Mahout]

Ted, Do you have a sequential version of LDA implementation that can be
used for reference ?
If yes, can you please post it on Jira ? Should we open a new Jira or
use MAHOUT-30 for this ?

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com]
Sent: Tuesday, May 27, 2008 11:50 AM
To: mahout-dev@lucene.apache.org
Subject: Re: LDA [was RE: Taste on Mahout]

Chris Bishop's book has a very clear exposition of the relationship
between the variational techniques and EM.  Very good reading.

On Mon, May 26, 2008 at 10:13 PM, Goel, Ankur <An...@corp.aol.com>
wrote:

> Daniel/Ted,
>      Thanks for the interesting pointers to more information on LDA 
> and EM.
> I am going through the docs to visualize and understand how LDA 
> approach would work for my specific case.
>
> Once I have some idea, I can volunteer to work on the Map-Reduce side 
> of
>
> thngs as this is something that will benefit both my project and the 
> community.
>
> Looking forward to share more ideas/information on this :-)
>
> Regards
> -Ankur
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Tuesday, May 27, 2008 6:59 AM
> To: mahout-dev@lucene.apache.org
> Subject: Re: LDA [was RE: Taste on Mahout]
>
> Those are both new to me.  Both look interesting.  My own experience 
> is that the simplicity of the Gibb's sampling makes it very much more 
> attractive for implementation.  Also, since it is (nearly) trivially 
> parallelizable, it is more likely we will get a useful implementation 
> right off the bat.
>
> On Mon, May 26, 2008 at 5:49 PM, Daniel Kluesing 
> <da...@ilike-inc.com>
> wrote:
>
> > (Hijacking the thread to discuss ways to implement LDA)
> >
> > Had you seen
> > http://books.nips.cc/papers/files/nips20/NIPS2007_0672.pdf
> > ?
> >
> > Their hierarchical distributed LDA formulation uses gibbs sampling 
> > and
>
> > fits into mapreduce.
> >
> > http://www.cs.berkeley.edu/~jawolfe/pubs/08-icml-em.pdf<http://www.c
> > s.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>
> <http://www.cs.
> > berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>gives a mapreduce
> formulation for the variational EM method.
> >
> > I'm still chewing on them, but my first impression is that the EM 
> > approach would give better performance on bigger data sets. Opposing

> > views welcome.
> >
> >
>



--
ted

RE: LDA [was RE: Taste on Mahout]

Posted by "Goel, Ankur" <An...@corp.aol.com>.

Ted, Do you have a sequential version of LDA implementation that can be
used for reference ?
If yes, can you please post it on Jira ? Should we open a new Jira or
use MAHOUT-30 for this ?

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Tuesday, May 27, 2008 11:50 AM
To: mahout-dev@lucene.apache.org
Subject: Re: LDA [was RE: Taste on Mahout]

Chris Bishop's book has a very clear exposition of the relationship
between the variational techniques and EM.  Very good reading.

On Mon, May 26, 2008 at 10:13 PM, Goel, Ankur <An...@corp.aol.com>
wrote:

> Daniel/Ted,
>      Thanks for the interesting pointers to more information on LDA 
> and EM.
> I am going through the docs to visualize and understand how LDA 
> approach would work for my specific case.
>
> Once I have some idea, I can volunteer to work on the Map-Reduce side 
> of
>
> thngs as this is something that will benefit both my project and the 
> community.
>
> Looking forward to share more ideas/information on this :-)
>
> Regards
> -Ankur
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Tuesday, May 27, 2008 6:59 AM
> To: mahout-dev@lucene.apache.org
> Subject: Re: LDA [was RE: Taste on Mahout]
>
> Those are both new to me.  Both look interesting.  My own experience 
> is that the simplicity of the Gibb's sampling makes it very much more 
> attractive for implementation.  Also, since it is (nearly) trivially 
> parallelizable, it is more likely we will get a useful implementation 
> right off the bat.
>
> On Mon, May 26, 2008 at 5:49 PM, Daniel Kluesing 
> <da...@ilike-inc.com>
> wrote:
>
> > (Hijacking the thread to discuss ways to implement LDA)
> >
> > Had you seen
> > http://books.nips.cc/papers/files/nips20/NIPS2007_0672.pdf
> > ?
> >
> > Their hierarchical distributed LDA formulation uses gibbs sampling 
> > and
>
> > fits into mapreduce.
> >
> > http://www.cs.berkeley.edu/~jawolfe/pubs/08-icml-em.pdf<http://www.c
> > s.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>
> <http://www.cs.
> > berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>gives a mapreduce
> formulation for the variational EM method.
> >
> > I'm still chewing on them, but my first impression is that the EM 
> > approach would give better performance on bigger data sets. Opposing

> > views welcome.
> >
> >
>



--
ted

Re: LDA [was RE: Taste on Mahout]

Posted by Ted Dunning <te...@gmail.com>.

Chris Bishop's book has a very clear exposition of the relationship between
the variational techniques and EM.  Very good reading.

On Mon, May 26, 2008 at 10:13 PM, Goel, Ankur <An...@corp.aol.com>
wrote:

> Daniel/Ted,
>      Thanks for the interesting pointers to more information on LDA and
> EM.
> I am going through the docs to visualize and understand how LDA approach
> would work for my specific case.
>
> Once I have some idea, I can volunteer to work on the Map-Reduce side of
>
> thngs as this is something that will benefit both my project and the
> community.
>
> Looking forward to share more ideas/information on this :-)
>
> Regards
> -Ankur
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Tuesday, May 27, 2008 6:59 AM
> To: mahout-dev@lucene.apache.org
> Subject: Re: LDA [was RE: Taste on Mahout]
>
> Those are both new to me.  Both look interesting.  My own experience is
> that the simplicity of the Gibb's sampling makes it very much more
> attractive for implementation.  Also, since it is (nearly) trivially
> parallelizable, it is more likely we will get a useful implementation
> right off the bat.
>
> On Mon, May 26, 2008 at 5:49 PM, Daniel Kluesing <da...@ilike-inc.com>
> wrote:
>
> > (Hijacking the thread to discuss ways to implement LDA)
> >
> > Had you seen
> > http://books.nips.cc/papers/files/nips20/NIPS2007_0672.pdf
> > ?
> >
> > Their hierarchical distributed LDA formulation uses gibbs sampling and
>
> > fits into mapreduce.
> >
> > http://www.cs.berkeley.edu/~jawolfe/pubs/08-icml-em.pdf<http://www.cs.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>
> <http://www.cs.
> > berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>gives a mapreduce
> formulation for the variational EM method.
> >
> > I'm still chewing on them, but my first impression is that the EM
> > approach would give better performance on bigger data sets. Opposing
> > views welcome.
> >
> >
>



-- 
ted

RE: LDA [was RE: Taste on Mahout]

Posted by "Goel, Ankur" <An...@corp.aol.com>.

Daniel/Ted,
      Thanks for the interesting pointers to more information on LDA and
EM. 
I am going through the docs to visualize and understand how LDA approach
would work for my specific case. 

Once I have some idea, I can volunteer to work on the Map-Reduce side of

thngs as this is something that will benefit both my project and the
community.

Looking forward to share more ideas/information on this :-)

Regards
-Ankur

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Tuesday, May 27, 2008 6:59 AM
To: mahout-dev@lucene.apache.org
Subject: Re: LDA [was RE: Taste on Mahout]

Those are both new to me.  Both look interesting.  My own experience is
that the simplicity of the Gibb's sampling makes it very much more
attractive for implementation.  Also, since it is (nearly) trivially
parallelizable, it is more likely we will get a useful implementation
right off the bat.

On Mon, May 26, 2008 at 5:49 PM, Daniel Kluesing <da...@ilike-inc.com>
wrote:

> (Hijacking the thread to discuss ways to implement LDA)
>
> Had you seen 
> http://books.nips.cc/papers/files/nips20/NIPS2007_0672.pdf
> ?
>
> Their hierarchical distributed LDA formulation uses gibbs sampling and

> fits into mapreduce.
>
> http://www.cs.berkeley.edu/~jawolfe/pubs/08-icml-em.pdf<http://www.cs.
> berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>gives a mapreduce
formulation for the variational EM method.
>
> I'm still chewing on them, but my first impression is that the EM 
> approach would give better performance on bigger data sets. Opposing 
> views welcome.
>
>

Re: LDA [was RE: Taste on Mahout]

Posted by Ted Dunning <te...@gmail.com>.

Those are both new to me.  Both look interesting.  My own experience is that
the simplicity of the Gibb's sampling makes it very much more attractive for
implementation.  Also, since it is (nearly) trivially parallelizable, it is
more likely we will get a useful implementation right off the bat.

On Mon, May 26, 2008 at 5:49 PM, Daniel Kluesing <da...@ilike-inc.com>
wrote:

> (Hijacking the thread to discuss ways to implement LDA)
>
> Had you seen http://books.nips.cc/papers/files/nips20/NIPS2007_0672.pdf
> ?
>
> Their hierarchical distributed LDA formulation uses gibbs sampling and
> fits into mapreduce.
>
> http://www.cs.berkeley.edu/~jawolfe/pubs/08-icml-em.pdf<http://www.cs.berkeley.edu/%7Ejawolfe/pubs/08-icml-em.pdf>gives a
> mapreduce formulation for the variational EM method.
>
> I'm still chewing on them, but my first impression is that the EM
> approach would give better performance on bigger data sets. Opposing
> views welcome.
>
>