You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Jack Tanner <ih...@hotmail.com> on 2009/09/23 22:58:13 UTC

LDA and utils.vectors.TermEntry

The TermEntry constructor is (String term, int termIdx, int docFreq). What's the point of termIdx? I see that it gets used for an assert in LDAPrintTopics.java:readDictionary() , but it seems redundant otherwise.
(Background: I'd like to generate vectors for LDA directly, bypassing Lucene. Following o.a.m.utils.vectors.lucene.Driver, I see that I need to generate a dictionary file for the "printing out top terms per topic" step. This uses TermInfo, which contains lots of TermEntry elements.) 		 	   		  
_________________________________________________________________
Bing™  brings you maps, menus, and reviews organized in one place.   Try it now.
http://www.bing.com/search?q=restaurants&form=MLOGEN&publ=WLHMTAG&crea=TEXT_MLOGEN_Core_tagline_local_1x1

Re: LDA and utils.vectors.TermEntry

Posted by Sean Owen <sr...@gmail.com>.
FWIW Grant I only see it used in two places:

TFDFMapper.map() where it's used as an index into a vector
JWriterTermInfoWriter.write() where it is merely output, not really used

On Wed, Sep 23, 2009 at 4:32 PM, Grant Ingersoll <gs...@apache.org> wrote:
> The term entries are used to map the text to a position in the Vector.  So,
> the readDictionary is just loading up that mapping such that when it
> examines the vector it can print out that term 14534 is really "foobar", or
> whatever.
>
> There may be an abstraction to be made here, but I'd have to dig a little
> deeper into the code to say for sure.
>
>
> On Sep 23, 2009, at 4:58 PM, Jack Tanner wrote:
>
>>
>> The TermEntry constructor is (String term, int termIdx, int docFreq).
>> What's the point of termIdx? I see that it gets used for an assert in
>> LDAPrintTopics.java:readDictionary() , but it seems redundant otherwise.
>> (Background: I'd like to generate vectors for LDA directly, bypassing
>> Lucene. Following o.a.m.utils.vectors.lucene.Driver, I see that I need to
>> generate a dictionary file for the "printing out top terms per topic" step.
>> This uses TermInfo, which contains lots of TermEntry elements.)
>>
>> _________________________________________________________________
>> Bing™  brings you maps, menus, and reviews organized in one place.   Try
>> it now.
>>
>> http://www.bing.com/search?q=restaurants&form=MLOGEN&publ=WLHMTAG&crea=TEXT_MLOGEN_Core_tagline_local_1x1
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Re: LDA and utils.vectors.TermEntry

Posted by Grant Ingersoll <gs...@apache.org>.
The term entries are used to map the text to a position in the  
Vector.  So, the readDictionary is just loading up that mapping such  
that when it examines the vector it can print out that term 14534 is  
really "foobar", or whatever.

There may be an abstraction to be made here, but I'd have to dig a  
little deeper into the code to say for sure.


On Sep 23, 2009, at 4:58 PM, Jack Tanner wrote:

>
> The TermEntry constructor is (String term, int termIdx, int  
> docFreq). What's the point of termIdx? I see that it gets used for  
> an assert in LDAPrintTopics.java:readDictionary() , but it seems  
> redundant otherwise.
> (Background: I'd like to generate vectors for LDA directly,  
> bypassing Lucene. Following o.a.m.utils.vectors.lucene.Driver, I see  
> that I need to generate a dictionary file for the "printing out top  
> terms per topic" step. This uses TermInfo, which contains lots of  
> TermEntry elements.) 		 	   		
> _________________________________________________________________
> Bing™  brings you maps, menus, and reviews organized in one place.    
> Try it now.
> http://www.bing.com/search?q=restaurants&form=MLOGEN&publ=WLHMTAG&crea=TEXT_MLOGEN_Core_tagline_local_1x1

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search