You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Jake Mannix (JIRA)" <ji...@apache.org> on 2011/05/23 16:37:47 UTC

[jira] [Resolved] (MAHOUT-683) LDA Vectorization

     [ https://issues.apache.org/jira/browse/MAHOUT-683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jake Mannix resolved MAHOUT-683.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.5

You do indeed.  In the output directory of LDA, there should be a directory containing all the state-<num> intermediate states, and also a docTopics sequence file directory which contains the projection of the documents onto each topic.

> LDA Vectorization
> -----------------
>
>                 Key: MAHOUT-683
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-683
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Vasil Vasilev
>            Priority: Minor
>              Labels: LDA., Vectorization
>             Fix For: 0.5
>
>         Attachments: MAHOUT-683.patch
>
>
> Currently the result of LDA clustering algorithm is a state which describes the probability of words, part of a corpus of documents, to belong to given topics. This probability is calculated for the whole corpus
> It is interesting, however, what is the average number of words of a given document that comes from a given topic. This information comes from the gamma vector in the LDA inference process. This vector can be used as representation of the given document for further clustering purposes (using algorithms like KMeans, Dirichlet, etc.). In this manner the dimensions of a document get reduced to the number of topics that is specified to the LDA clustering algorithm.
> With the proposed implementation from a corpus of documents described as vectors and from the last state of LDA inference process a set of vectors with reduced dimensions is produced (a vector per a document) which represent the set of documents

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira