You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Ted Dunning (JIRA)" <ji...@apache.org> on 2010/08/02 20:44:16 UTC
[jira] Commented: (MAHOUT-399) LDA on Mahout 0.3 does not converge
to correct solution for overlapping pyramids toy problem.
[ https://issues.apache.org/jira/browse/MAHOUT-399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894640#action_12894640 ]
Ted Dunning commented on MAHOUT-399:
------------------------------------
I think that this needs more study. I got email from Mike and it does seem that there is a reasonable likelihood that there is still a serious problem. The problem is that I respect both Mike and David's opinions pretty highly and they seem to draw incompatible conclusions. That still leaves me with the feeling that a problem is reasonably likely (> 10% chance at least).
{quote}
Hi Ted,
I have implemented a parallel version of LDA in C# that separates the processing, but not the data. It is based on collapsed Gibbs sampling. And it converges to the correct solution on the overlapping pyramids dataset.
The last e-mail from David Hall indicated to me that he did not think the result for the dataset was conclusive evidence there is a bug. I disagree. The statistics of the dataset are overwhelming. And when you look at the computed likelihood of the corpus it typically reaches its maximum at 5 topics.
It took me a while to get hadoop up and running on ec2 and then to get the Mahout examples running. After David's e-mail indicating he did not think the result was conclusive, I decided to implement something for the environment I am working in.
I did not see much in the way of documentation for the Mahout implementation, but my guess at the algorithm was that it was using a variational method. Since I have not implemented that approach, I do not have an idea where the bug is yet.
Blei's C version implementation does converge as well. On rare occasion it does not converge, but rerunning it will almost always yield convergence.
I have run David Hall's implementation for different numbers of topics and repeatedly for each number of topics. It has never converged.
I did send a document along describing the dataset and providing a sample so that someone else could corroborate the result. I may have made a procedural error in running LDA even though I think I ran everything correctly.
I would be interested in looking at the variational approach and then trying to debug the current algorithm, but I do not have time to do that at the moment. Another option would be to convince David Hall to take a second look.
I hope that helps a little. I would be happy to talk to anyone in more detail.
Thanks,
Mike.
{quote}
> LDA on Mahout 0.3 does not converge to correct solution for overlapping pyramids toy problem.
> ---------------------------------------------------------------------------------------------
>
> Key: MAHOUT-399
> URL: https://issues.apache.org/jira/browse/MAHOUT-399
> Project: Mahout
> Issue Type: Bug
> Components: Classification
> Affects Versions: 0.3
> Environment: Mac OS X 10.6.2, Hadoop 0.20.2, Mahout 0.3.
> Reporter: Michael Lazarus
> Priority: Critical
> Attachments: olt.tar, Overlapping Pyramids Toy Dataset.pdf
>
>
> Hello,
> Apologies if I have not labeled this correctly.
> I have run a toy problem on Mahout 0.3 (locally) for LDA that I used to test Blei's c version of LDA that he posts on his site. It has an exact solution that the LDA should converge to. Please see attached PDF that describes the intended output.
> Is LDA working? The following output indicates some sort of collapsing behavior to me.
> T0 T1 T2 T3 T4
> x w x u x
> u u g j n
> l r i m l
> j q h h p
> v p e i q
> e t f g v
> d s d f o
> b c b n k
> y f c l m
> w v u v u
> c d p y t
> k o l r r
> i b j k j
> f e k e f
> g x y s y
> t y w b w
> h i s p s
> o l v x d
> q j t d i
> n k o t b
> The intended output is (again, please see attached):
> D I N S X
> d i n s x
> c h m t y
> e j o r w
> b k l u v
> f g p q a
> a f k p b
> g l q v u
> h m j w t
> y u r o c
> n s d d i
> s e x f f
> r q i i n
> m v w c o
> o w u a h
> q n s h g
> p t c x d
> t x f e l
> x d e j s
> w y g b j
> i r y n r
> u o h y m
> k b t l e
> v c a m k
> j a b g p
> l p v k q
> What tests do you run to make sure the output is correct?
> Thank you,
> Mike.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.