You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Charles Earl <ch...@me.com> on 2011/12/01 05:28:49 UTC

Re: LDATopic

Jake,
Thanks for the pending update.
Slightly off topic, if I understand your notes on MAHOUT-897, Gibbs sampling would only be feasible in MR implementation that support efficient iteration -- Spark, perhaps YARN -- but not for Mahout as currently conceived. In the case of Spark, the RDD  is the shared memory that enables faster synchronization across samplers. The need for synchronization across local samplers may mean that Gibbs sampling is better suited for openmp.
The approach in MAHOUT-897 is understandably similar  to http://arxiv.org/pdf/1107.3765 (Using Variational Inference and MapReduce to Scale Topic Modeling)
Do you have any recommendations on topic update that might work well (close to real time) in practice? 
For example Yao's http://www.cs.umass.edu/~lmyao/papers/fast-topic-model10.pdf suggest simple heuristics for identifying novel topics and memory efficient streaming update sparseLDA. I would expect that something based on sparseLDA would be efficient for online update. 
Charles


On Nov 30, 2011, at 4:14 PM, Jake Mannix wrote:

> On Wed, Nov 30, 2011 at 1:03 PM, Isabel Drost <is...@apache.org> wrote:
> 
>> On 28.11.2011 bish maten wrote:
>>> mahout ldatopics -i mahout-work/abc/abc-lda/state-20  -d
>>> mahout-work/abc/abc-out-seqdir-sparse-lda/dictionary.file-0  -dt
>>> sequencefile  (there were no errors reported and command worked fine with
>>> following output). Does the output appear ok?
>> 
>> Hmm - this only prints the resulting LDA topics - which command did you
>> use to
>> generate them?
>> 
>> Please also note that Jake is currently working on improving our LDA
>> support, if
>> you are interested in that algorithm it might be interesting for you to
>> look
>> into his patch in https://issues.apache.org/jira/browse/MAHOUT-897
> 
> 
> Yeah, I'm also working on moving away from LDATopic altogether, instead
> using
> VectorDumper + dictionary file and grabbing top N weighted elements in the
> vector
> representing the topic.  We already do this internally at Twitter, I just
> have to get
> that particular patch formatted properly and cleaned up once MAHOUT-897 gets
> committed (which will hopefully be this week).
> 
>  -jake