You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Robin Anil <ro...@gmail.com> on 2010/05/24 17:12:46 UTC

Problems with LDA

The email bounced because of the attachment size. Will upload it to JIRA

+Mike +David

See the email following email.   Anyone has any idea whats going on.

@David. Can you tell Mike the differences in the LDA implementation as
compared to the original version

Robin


---------- Forwarded message ----------
From: Mike Lazarus <mi...@yahoo.com>
Date: Mon, May 24, 2010 at 6:47 PM
Subject: Re: Mahout LDA - uh oh
To: Robin Anil <ro...@gmail.com>
Cc: Mike Lazarus <mi...@yahoo.com>


Hi Robin,

I have run a toy problem on Mahout 0.3 (locally) for LDA that I used to test
Blei's c version of LDA that he posts on his site.

It has an exact solution that the LDA should converge to.  Please see
attached PDF that describes the intended output.

Is LDA working?  The following output indicates some sort of collapsing
behavior to me.

  Topic 0 Topic 1 Topic 2 Topic 3 Topic 4  x w x u x  u u g j n  l r i m l
j q h h p  v p e i q  e t f g v  d s d f o  b c b n k  y f c l m  w v u v u
c d p y t  k o l r r  i b j k j  f e k e f  g x y s y  t y w b w  h i s p s
o l v x d  q j t d i  n k o t b

The intended output is (again, please see attached):

  D I N S X  d i n s x  c h m t y  e j o r w  b k l u v  f g p q a  a f k p
b  g l q v u  h m j w t  y u r o c  n s d d i  s e x f f  r q i i n  m v w c
o  o w u a h  q n s h g  p t c x d  t x f e l  x d e j s  w y g b j  i r y n
r  u o h y m  k b t l e  v c a m k  j a b g p  l p v k q

What tests do you run to make sure the output is correct?

Thanks,
Mike.



------------------------------
*From:* Robin Anil <ro...@gmail.com>
*To:* Mike Lazarus <mi...@yahoo.com>
*Sent:* Thu, May 20, 2010 9:34:50 AM
*Subject:* Re: Mahout LDA

Sorry about that. It was meant only as a tool for quickly transforming
Reuters into seqfiles on the local disk. Usually creating an input for the
seq2sparse  on a Hadoop system might involve writing a simple Map/Reduce job
to read your native document format and output them as SequenceFile<Text,
Text>

Robin


On Thu, May 20, 2010 at 9:58 PM, Robin Anil <ro...@gmail.com> wrote:

> seq directory I believe is not tested on HDFS. Try copying the input
> seqfiles and run rest of the pipeline. Can you also come to Mahout JIRA (bug
> tracker) and file a ticket.https://issues.apache.org/jira/browse/MAHOUT