You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Ian Upright <ia...@upright.net> on 2011/07/20 21:52:51 UTC
Yahoo LDA
Hi,
This is a little off topic, but perhaps someone on this list may be able to
comment.
I'm still fairly new to LDA, and I've been playing with Yahoo's LDA
implementation.
The Yahoo code produces a file called:
lda.worToTop.txt
www.teddybears.com/ recreation/toys (teddy,15) (bears,15) (enjoy,2)
(teddy,15) (bears,15) (enjoy,2) (featuring,41) (teddy,15)
www.bearsbythesea.com/ recreation/toys (teddy,99) (bear,99) (store,81)
(pismo,30) (beach,88) (california,24) (specialize,99) (muffy,99) (store,11)
(complete,11) (collections,46) (checkout,84) (web,87)
So this shows that teddy is in topic 15 adn in topic 99.
However, what I thought I would be looking for, is a vector, whereby each
word is defined as a set of probabilities into a particular topic. (eg,
with 600 topics I could have a vector that maps that word into each of those
600 topics)
This vector could then be used for calculating similarity against other
words, etc. Is the correct idea?
If so, using the Yahoo LDA output, for each unique word, I have to calculate
that vector and probability myself, using the above file? Perhaps I'm
missing something?
Thanks, Ian
Re: Yahoo LDA
Posted by Hector Yee <he...@gmail.com>.
The top few coefficients are in lda.topToWor.txt
The rest of it is probably in lda.top
On Wed, Jul 20, 2011 at 12:52 PM, Ian Upright <ia...@upright.net>wrote:
> Hi,
>
> This is a little off topic, but perhaps someone on this list may be able to
> comment.
>
> I'm still fairly new to LDA, and I've been playing with Yahoo's LDA
> implementation.
>
> The Yahoo code produces a file called:
>
> lda.worToTop.txt
>
> www.teddybears.com/ recreation/toys (teddy,15) (bears,15) (enjoy,2)
> (teddy,15) (bears,15) (enjoy,2) (featuring,41) (teddy,15)
> www.bearsbythesea.com/ recreation/toys (teddy,99) (bear,99) (store,81)
> (pismo,30) (beach,88) (california,24) (specialize,99) (muffy,99) (store,11)
> (complete,11) (collections,46) (checkout,84) (web,87)
>
> So this shows that teddy is in topic 15 adn in topic 99.
>
> However, what I thought I would be looking for, is a vector, whereby each
> word is defined as a set of probabilities into a particular topic. (eg,
> with 600 topics I could have a vector that maps that word into each of
> those
> 600 topics)
>
> This vector could then be used for calculating similarity against other
> words, etc. Is the correct idea?
>
> If so, using the Yahoo LDA output, for each unique word, I have to
> calculate
> that vector and probability myself, using the above file? Perhaps I'm
> missing something?
>
> Thanks, Ian
>
--
Yee Yang Li Hector
http://hectorgon.blogspot.com/ (tech + travel)
http://hectorgon.com (book reviews)