You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Ted Dunning (JIRA)" <ji...@apache.org> on 2010/08/02 20:44:16 UTC

[jira] Commented: (MAHOUT-399) LDA on Mahout 0.3 does not converge to correct solution for overlapping pyramids toy problem.

    [ https://issues.apache.org/jira/browse/MAHOUT-399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894640#action_12894640 ] 

Ted Dunning commented on MAHOUT-399:
------------------------------------

I think that this needs more study.  I got email from Mike and it does seem that there is a reasonable likelihood that there is still a serious problem.  The problem is that I respect both Mike and David's opinions pretty highly and they seem to draw incompatible conclusions.  That still leaves me with the feeling that a problem is reasonably likely (> 10% chance at least).

{quote}
Hi Ted,

I have implemented a parallel version of LDA in C# that separates the processing, but not the data.  It is based on collapsed Gibbs sampling.  And it converges to the correct solution on the overlapping pyramids dataset.

The last e-mail from David Hall indicated to me that he did not think the result for the dataset was conclusive evidence there is a bug.  I disagree.  The statistics of the dataset are overwhelming.  And when you look at the computed likelihood of the corpus it typically reaches its maximum at 5 topics.  

It took me a while to get hadoop up and running on ec2 and then to get the Mahout examples running.  After David's e-mail indicating he did not think the result was conclusive, I decided to implement something for the environment I am working in.

I did not see much in the way of documentation for the Mahout implementation, but my guess at the algorithm was that it was using a variational method.  Since I have not implemented that approach, I do not have an idea where the bug is yet.

Blei's C version implementation does converge as well.  On rare occasion it does not converge, but rerunning it will almost always yield convergence.

I have run David Hall's implementation for different numbers of topics and repeatedly for each number of topics.  It has never converged.

I did send a document along describing the dataset and providing a sample so that someone else could corroborate the result.  I may have made a procedural error in running LDA even though I think I ran everything correctly.  

I would be interested in looking at the variational approach and then trying to debug the current algorithm, but I do not have time to do that at the moment.  Another option would be to convince David Hall to take a second look.

I hope that helps a little.  I would be happy to talk to anyone in more detail.

Thanks,
Mike.
{quote}

> LDA on Mahout 0.3 does not converge to correct solution for overlapping pyramids toy problem.
> ---------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-399
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-399
>             Project: Mahout
>          Issue Type: Bug
>          Components: Classification
>    Affects Versions: 0.3
>         Environment: Mac OS X 10.6.2, Hadoop 0.20.2, Mahout 0.3.
>            Reporter: Michael Lazarus
>            Priority: Critical
>         Attachments: olt.tar, Overlapping Pyramids Toy Dataset.pdf
>
>
> Hello,
> Apologies if I have not labeled this correctly.
> I have run a toy problem on Mahout 0.3 (locally) for LDA that I used to test Blei's c version of LDA that he posts on his site. It has an exact solution that the LDA should converge to.  Please see attached PDF that describes the intended output.
> Is LDA working?  The following output indicates some sort of collapsing behavior to me.
> T0 	T1 	T2 	T3 	T4
> x 	w 	x 	u 	x
> u 	u 	g 	j 	n
> l 	r 	i 	m 	l
> j 	q 	h 	h 	p
> v 	p 	e 	i 	q
> e 	t 	f 	g 	v
> d 	s 	d 	f 	o
> b 	c 	b 	n 	k
> y 	f 	c 	l 	m
> w 	v 	u 	v 	u
> c 	d 	p 	y 	t
> k 	o 	l 	r 	r
> i 	b 	j 	k 	j
> f 	e 	k 	e 	f
> g 	x 	y 	s 	y
> t 	y 	w 	b 	w
> h 	i 	s 	p 	s
> o 	l 	v 	x 	d
> q 	j 	t 	d 	i
> n 	k 	o 	t 	b
> The intended output is (again, please see attached):
> D 	I 	N 	S 	X
> d 	i 	n 	s 	x
> c 	h 	m 	t 	y
> e 	j 	o 	r 	w
> b 	k 	l 	u 	v
> f 	g 	p 	q 	a
> a 	f 	k 	p 	b
> g 	l 	q 	v 	u
> h 	m 	j 	w 	t
> y 	u 	r 	o 	c
> n 	s 	d 	d 	i
> s 	e 	x 	f 	f
> r 	q 	i 	i 	n
> m 	v 	w 	c 	o
> o 	w 	u 	a 	h
> q 	n 	s 	h 	g
> p 	t 	c 	x 	d
> t 	x 	f 	e 	l
> x 	d 	e 	j 	s
> w 	y 	g 	b 	j
> i 	r 	y 	n 	r
> u 	o 	h 	y 	m
> k 	b 	t 	l 	e
> v 	c 	a 	m 	k
> j 	a 	b 	g 	p
> l 	p 	v 	k 	q
> What tests do you run to make sure the output is correct?
> Thank you,
> Mike.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.