You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Anton Leuski <le...@ict.usc.edu> on 2005/10/08 22:31:15 UTC

Adding information to an index

Greetings,

I'm looking to store some additional information in a Lucene index  
and I'm looking for an advise on how to implement the functionality.  
Specifically, I'm planning to store 1) collection frequency count for  
each term, 2) actual document length for each document (yes, I looked  
at the norm factor, I'm still considering how to adapt it...) 3)  
collection size (total number of terms) for each field 4) vocabulary  
size (number of unique terms) for each field. All this info can be  
computed on the fly, but I would prefer to generate it at the  
indexing time and store somewhere.

I think I figured out how to handle  #1) -- I found a post by Doug  
Cutting about it which pointed me in the right direction.  What to do  
about the rest of the info? I'd like the implementation to  
automatically update the counts as documents are added and deleted  
from the index.

Thank you.

-- Anton



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Adding information to an index

Posted by Chris Hostetter <ho...@fucit.org>.
: I'm looking to store some additional information in a Lucene index
: and I'm looking for an advise on how to implement the functionality.
: Specifically, I'm planning to store 1) collection frequency count for
: each term, 2) actual document length for each document (yes, I looked
: at the norm factor, I'm still considering how to adapt it...) 3)
: collection size (total number of terms) for each field 4) vocabulary
: size (number of unique terms) for each field. All this info can be
: computed on the fly, but I would prefer to generate it at the
: indexing time and store somewhere.

Unless I'm missunderstanding your terminology, It seems like all of this
information is either already stored in the index, or easy to add using
the existing API


  #1 - Searchable.docFreq(Term):int
  #2 - add as a new field per document.
  #3 & #4 ...

...these are a little trickier.  You can easily get both by iterating over
IndexReader.terms(), but if you specifically want to store the data in the
index, I would first add all of your documents, then use the TermEnum
to compute the information and put it all as stored fields in a single
"metadata" document with no indexed fields (or at least: none in common
with your regular data).

now you've precomputed everything you want to know, and it's easily
available at query time.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org