You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Christoph Boden (JIRA)" <ji...@apache.org> on 2011/09/21 19:57:16 UTC

[jira] [Created] (MAHOUT-815) LDA Inference Corrections, Alpha (Dirichlet) Estimation

LDA Inference Corrections, Alpha (Dirichlet) Estimation
-------------------------------------------------------

                 Key: MAHOUT-815
                 URL: https://issues.apache.org/jira/browse/MAHOUT-815
             Project: Mahout
          Issue Type: Improvement
          Components: Clustering
    Affects Versions: 0.6
            Reporter: Christoph Boden


Hi, I am a PhD Student at TU Berlin DIMA. I am currently working on Mahouts LDA Implementation together with Sebastian Schelter. We identified a couple of points that can be fixed or improved in the current version.

We propose to fix the inference in the expectation step of EM in accordance with [1], implement maximum likelihood estimation of the dirichlet distribution (alpha) as presented in [1] and some refacoring.

[1]Blei, David M.; Ng, Andrew Y.; Jordan, Michael I (January 2003). Lafferty, John. ed. "Latent Dirichlet allocation". Journal of Machine Learning Research 3 (4-5): pp. 993-1022. doi:10.1162/jmlr.2003.3.4-5.993 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-815) LDA Inference Corrections, Alpha (Dirichlet) Estimation

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13109845#comment-13109845 ] 

Jake Mannix commented on MAHOUT-815:
------------------------------------

Yes, Yahoo primarily does Gibbs sampling (or "collapsed gibbs sampling" to be more precise), which is just a stochastic version of exactly the same update equations in collapsed variational bayes.

> LDA Inference Corrections, Alpha (Dirichlet) Estimation
> -------------------------------------------------------
>
>                 Key: MAHOUT-815
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-815
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Christoph Boden
>            Assignee: Sebastian Schelter
>
> Hi, I am a PhD Student at TU Berlin DIMA. I am currently working on Mahouts LDA Implementation together with Sebastian Schelter. We identified a couple of points that can be fixed or improved in the current version.
> We propose to fix the inference in the expectation step of EM in accordance with [1], implement maximum likelihood estimation of the dirichlet distribution (alpha) as presented in [1] and some refacoring.
> [1]Blei, David M.; Ng, Andrew Y.; Jordan, Michael I (January 2003). Lafferty, John. ed. "Latent Dirichlet allocation". Journal of Machine Learning Research 3 (4-5): pp. 993-1022. doi:10.1162/jmlr.2003.3.4-5.993 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (MAHOUT-815) LDA Inference Corrections, Alpha (Dirichlet) Estimation

Posted by "Sebastian Schelter (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter resolved MAHOUT-815.
---------------------------------------

    Resolution: Won't Fix

Closing this, as the optimization will not be necessary once Jake's updated LDA implementation is committed.
                
> LDA Inference Corrections, Alpha (Dirichlet) Estimation
> -------------------------------------------------------
>
>                 Key: MAHOUT-815
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-815
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Christoph Boden
>            Assignee: Sebastian Schelter
>
> Hi, I am a PhD Student at TU Berlin DIMA. I am currently working on Mahouts LDA Implementation together with Sebastian Schelter. We identified a couple of points that can be fixed or improved in the current version.
> We propose to fix the inference in the expectation step of EM in accordance with [1], implement maximum likelihood estimation of the dirichlet distribution (alpha) as presented in [1] and some refacoring.
> [1]Blei, David M.; Ng, Andrew Y.; Jordan, Michael I (January 2003). Lafferty, John. ed. "Latent Dirichlet allocation". Journal of Machine Learning Research 3 (4-5): pp. 993-1022. doi:10.1162/jmlr.2003.3.4-5.993 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-815) LDA Inference Corrections, Alpha (Dirichlet) Estimation

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13109724#comment-13109724 ] 

Jake Mannix commented on MAHOUT-815:
------------------------------------

I would suggest holding off on this a little, or else looking at my complete reworking of Mahout's LDA implementation over on GitHub: https://github.com/jakemannix/Mahout - look on the "cvb0" branch - I've moved from doing a straightforward Variational Bayes (as in the original paper) to a "Collapsed Variational Bayes" with some approximations which speed it up by a factor of 10-15, and no longer require the entire model live in memory.

Refactoring on the current codebase will get squashed by these changes, I'm afraid.  I'll really try to clean that code up and put up a patch for review this week or next.

> LDA Inference Corrections, Alpha (Dirichlet) Estimation
> -------------------------------------------------------
>
>                 Key: MAHOUT-815
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-815
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Christoph Boden
>            Assignee: Sebastian Schelter
>
> Hi, I am a PhD Student at TU Berlin DIMA. I am currently working on Mahouts LDA Implementation together with Sebastian Schelter. We identified a couple of points that can be fixed or improved in the current version.
> We propose to fix the inference in the expectation step of EM in accordance with [1], implement maximum likelihood estimation of the dirichlet distribution (alpha) as presented in [1] and some refacoring.
> [1]Blei, David M.; Ng, Andrew Y.; Jordan, Michael I (January 2003). Lafferty, John. ed. "Latent Dirichlet allocation". Journal of Machine Learning Research 3 (4-5): pp. 993-1022. doi:10.1162/jmlr.2003.3.4-5.993 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (MAHOUT-815) LDA Inference Corrections, Alpha (Dirichlet) Estimation

Posted by "Sebastian Schelter (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Schelter reassigned MAHOUT-815:
-----------------------------------------

    Assignee: Sebastian Schelter

> LDA Inference Corrections, Alpha (Dirichlet) Estimation
> -------------------------------------------------------
>
>                 Key: MAHOUT-815
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-815
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Christoph Boden
>            Assignee: Sebastian Schelter
>
> Hi, I am a PhD Student at TU Berlin DIMA. I am currently working on Mahouts LDA Implementation together with Sebastian Schelter. We identified a couple of points that can be fixed or improved in the current version.
> We propose to fix the inference in the expectation step of EM in accordance with [1], implement maximum likelihood estimation of the dirichlet distribution (alpha) as presented in [1] and some refacoring.
> [1]Blei, David M.; Ng, Andrew Y.; Jordan, Michael I (January 2003). Lafferty, John. ed. "Latent Dirichlet allocation". Journal of Machine Learning Research 3 (4-5): pp. 993-1022. doi:10.1162/jmlr.2003.3.4-5.993 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-815) LDA Inference Corrections, Alpha (Dirichlet) Estimation

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13109765#comment-13109765 ] 

Ted Dunning commented on MAHOUT-815:
------------------------------------

Alex Smola has an excellent blog that might provide some interesting insights into additional improvements:

http://blog.smola.org/post/6359713161/speeding-up-latent-dirichlet-allocation

The article that you mention here seems a bit old and the newer references that Alex gives might be better to use.

> LDA Inference Corrections, Alpha (Dirichlet) Estimation
> -------------------------------------------------------
>
>                 Key: MAHOUT-815
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-815
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>    Affects Versions: 0.6
>            Reporter: Christoph Boden
>            Assignee: Sebastian Schelter
>
> Hi, I am a PhD Student at TU Berlin DIMA. I am currently working on Mahouts LDA Implementation together with Sebastian Schelter. We identified a couple of points that can be fixed or improved in the current version.
> We propose to fix the inference in the expectation step of EM in accordance with [1], implement maximum likelihood estimation of the dirichlet distribution (alpha) as presented in [1] and some refacoring.
> [1]Blei, David M.; Ng, Andrew Y.; Jordan, Michael I (January 2003). Lafferty, John. ed. "Latent Dirichlet allocation". Journal of Machine Learning Research 3 (4-5): pp. 993-1022. doi:10.1162/jmlr.2003.3.4-5.993 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira