You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Jake Mannix (JIRA)" <ji...@apache.org> on 2011/05/01 07:53:03 UTC

[jira] [Commented] (MAHOUT-684) Topics regularization for LDA

    [ https://issues.apache.org/jira/browse/MAHOUT-684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13027431#comment-13027431 ] 

Jake Mannix commented on MAHOUT-684:
------------------------------------

Hi Vasil,

  I've been trying to incorporate this patch with the patch I have on MAHOUT-682 (similar to what you've got in MAHOUT-683), but in addition to getting tripped up on all the static methods (which are not so great for unit testing and break encapsulation pretty badly), the LDADriver#writeNewAlpha() seems to do very strange things: it first loads the entire LDAState up with createState(), then it iterates over the entire HDFS-serialized intermediate state (which should also be the same as what is iterated over in createState(), right?), finds the digammaGamma vector, then does some cool estimation of the new alpha stuff, and then creates a SequenceFileWriter to write the entire state back out again (but now with the newly estimated alpha).  The IO-behavior of this seems pretty atrocious.

  I'd really like to get this new alpha-estimation stuff in, it looks great, but we've got to clean up the way we're reading/writing state to HDFS.  At the bare minimum, we should read the intermediate state once after every iteration, and write it back out (with the new alpha) once.  Better than that: use multiple Paths, multiple outputs (although this is yet again something that the Hadoop 0.20 API is not compatible with - you have to go back to the deprecated o.a.m.mapred codebase to do this, just like for doing map-side joins, ARG!).

  Do you think you could help me incorporate this algorithm improvement into a patch once I've got MAHOUT-682 merged in to trunk?

> Topics regularization for LDA
> -----------------------------
>
>                 Key: MAHOUT-684
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-684
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Vasil Vasilev
>            Priority: Minor
>              Labels: LDA.
>         Attachments: MAHOUT-684.patch
>
>
> Implementation provided for the alpha parameters estimation as described in the paper of Blei, Ng and Jordan (http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf).
> Remark: there is a mistake in the last formula in A.4.2 (the signs are wrong). The correct version is described here: http://www.cs.cmu.edu/~jch1/research/dirichlet/dirichlet.pdf (page 6).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira