You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Ian Upright <ia...@upright.net> on 2011/07/20 21:52:51 UTC

Yahoo LDA

Hi,

This is a little off topic, but perhaps someone on this list may be able to
comment.

I'm still fairly new to LDA, and I've been playing with Yahoo's LDA
implementation.

The Yahoo code produces a file called:

lda.worToTop.txt

www.teddybears.com/	recreation/toys	(teddy,15) (bears,15) (enjoy,2)
(teddy,15) (bears,15) (enjoy,2) (featuring,41) (teddy,15)
www.bearsbythesea.com/	recreation/toys	(teddy,99) (bear,99) (store,81)
(pismo,30) (beach,88) (california,24) (specialize,99) (muffy,99) (store,11)
(complete,11) (collections,46) (checkout,84) (web,87) 

So this shows that teddy is in topic 15 adn in topic 99.

However, what I thought I would be looking for, is a vector, whereby each
word is defined as a set of probabilities into a particular topic.  (eg,
with 600 topics I could have a vector that maps that word into each of those
600 topics)

This vector could then be used for calculating similarity against other
words, etc.  Is the correct idea?

If so, using the Yahoo LDA output, for each unique word, I have to calculate
that vector and probability myself, using the above file?  Perhaps I'm
missing something?

Thanks, Ian

Re: Yahoo LDA

Posted by Hector Yee <he...@gmail.com>.

The top few coefficients are in lda.topToWor.txt
The rest of it is probably in lda.top


On Wed, Jul 20, 2011 at 12:52 PM, Ian Upright <ia...@upright.net>wrote:

> Hi,
>
> This is a little off topic, but perhaps someone on this list may be able to
> comment.
>
> I'm still fairly new to LDA, and I've been playing with Yahoo's LDA
> implementation.
>
> The Yahoo code produces a file called:
>
> lda.worToTop.txt
>
> www.teddybears.com/     recreation/toys (teddy,15) (bears,15) (enjoy,2)
> (teddy,15) (bears,15) (enjoy,2) (featuring,41) (teddy,15)
> www.bearsbythesea.com/  recreation/toys (teddy,99) (bear,99) (store,81)
> (pismo,30) (beach,88) (california,24) (specialize,99) (muffy,99) (store,11)
> (complete,11) (collections,46) (checkout,84) (web,87)
>
> So this shows that teddy is in topic 15 adn in topic 99.
>
> However, what I thought I would be looking for, is a vector, whereby each
> word is defined as a set of probabilities into a particular topic.  (eg,
> with 600 topics I could have a vector that maps that word into each of
> those
> 600 topics)
>
> This vector could then be used for calculating similarity against other
> words, etc.  Is the correct idea?
>
> If so, using the Yahoo LDA output, for each unique word, I have to
> calculate
> that vector and probability myself, using the above file?  Perhaps I'm
> missing something?
>
> Thanks, Ian
>



-- 
Yee Yang Li Hector
http://hectorgon.blogspot.com/ (tech + travel)
http://hectorgon.com (book reviews)