You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by ext <ex...@prodigy.teamon.com> on 2003/02/14 15:10:44 UTC

Quickest way to build a Document - (Keyword, Freq)* map

Hi, 

I am using Lucene right now to index several semi-structured documents. I 
recently had to implement a method 'getFrequencyVector()' to simply return 
a mapping of keyword -> frequency from the information already in the 
lucene index. 

I currently maintain the lucene index on basis of the keyword -> (document, 
freq)* mapping. The best solution I could come up with is to iterate over 
all the keywords ( :( ) match my own document identifier and build the 
vector. Any ideas/suggestions? 

Is there a way to speed up the vector computation? It currently takes a 
|k|*|d| where |k| is the total number of keywords indexed and |d| is the 
average number of documents a keyword can occur in. 

Ideally, I would like to have a forward index, document to the pair 
(keyword, frequency) for this application. Thank you in advance for you 
expertise and your time. 

Cheers, 
Santosh Dawara 
Graduate Student 
Rochester Instt of Tech


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Quickest way to build a Document - (Keyword, Freq)* map

Posted by Ype Kingma <yk...@xs4all.nl>.

On Friday 14 February 2003 15:10, you wrote:
> Hi,
>
> I am using Lucene right now to index several semi-structured documents. I
> recently had to implement a method 'getFrequencyVector()' to simply return
> a mapping of keyword -> frequency from the information already in the
> lucene index.
>
> I currently maintain the lucene index on basis of the keyword -> (document,
> freq)* mapping. The best solution I could come up with is to iterate over
> all the keywords ( :( ) match my own document identifier and build the
> vector. Any ideas/suggestions?
>
> Is there a way to speed up the vector computation? It currently takes a
>
> |k|*|d| where |k| is the total number of keywords indexed and |d| is the
>
> average number of documents a keyword can occur in.

The TermDocs.skipTo(docNr) mentions:  "Some implementations are considerably 
more efficient than that", ie. more efficient than linear.
With the current index format of lucene, this is not possible at less than 
O(|d|) cost, because the term frequency (.frq) file is only indexed from the 
term file, ie. by term, and not by (term,document nr) pairs. This might be 
useful for terms occurring in larger portion of the documents.

So, if I understand this correctly, skipTo(docNr) might help in performance,
but from what I know you cannot currently do better than O(|d|) per term.

Kind regards,
Ype

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org