You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2006/06/01 01:44:02 UTC

Re: Flexible Indexing (was Re: Lucene Planning)

[wild brainstorming...]

Another reason to consolidate the freqs, positions, and boosts/norms  
into one file: we can isolate and distill the code that encodes/ 
decodes that file into a plugin, weakening the current tight coupling  
between Lucene and its file format.  Changing that index format might  
then be a little less painful, as we'd just write a new plugin but  
leave the old one sitting there.  We may not be able to write plugin  
code for the an entire index, but we can write some for each file.

I'm imagining a PostingsWriter interface that each plugin would  
implement, then a complementary PostingsReader.  PostingsReader would  
look a lot like TermPositions does now, but would add getBoost().  To  
this, a POSPostingsReader subclass might add getPartOfSpeech().

In addition to the postings file, we might want a stored fields file  
plugin.  Maybe call those interfaces DBWriter and DBReader.  This is  
trickier, because stored fields are not inverted, so if we used  
different codecs for each field, their output would have to be  
interleaved.  Bleah.  Seems more like we'd want to use a plugin for  
the entire file, with a limited selection of per-field options.

Each segment would have a file recording which codecs were in use.   
Each field name, once associated with a codec, could not be modified  
to use another.  No more reconciliation of indexed/notIndexed,  
omitNorms/notOmitNorms.

Does it make sense then to have the Term Dictionary as a plugin?  I  
think so.  But maybe rather than ordering all terms first by field  
name then by term text, each indexed field should have its own  
dictionary file, ordered by term text.  Then the dictionary file  
could have per-field customization as well.

The point of this exercise is to generalize the high level data  
structures required by an inverted indexing engine.

   * Term Dictionary
   * Postings
   * Stored Fields Database
   * Term Vectors (optional)

In my view, each of these should have its own pluggable codec.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Flexible Indexing (was Re: Lucene Planning)

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Jun 2, 2006, at 6:48 AM, Grant Ingersoll wrote:
> I thought it was you, but wasn't sure.

I'm always looking for ways to minimize Term Vectors, because I  
consider excerpting/highlighting a core feature rather than an add- 
on, and they seem like such overkill.  It bothers me that they  
duplicate so much information.

I've been toying with the idea of a hitCollector.collect(int docNum,  
float score, ScorePositions[] scorePositions) method -- or, more  
likely, a hitCollector.collect(Scorer scorer) method -- that would  
preserve each position that contributed to the score of a document  
and how much it contributed, allowing that information to be passed  
through a Hit object to the Highlighter.

That might be complemented storing the startOffsets and endOffsets  
for each field as streams of delta-encoded VInts along with the  
stored field data.  Conceptually, it would be even cleaner to keep  
startOffsets and endOffsets in the postings...

a. <doc>+

b. <doc, boost>+

c. <doc, freq, <position>+ >+

d. <doc, freq, <position, boost>+ >+

e. <doc, freq, <position, boost, startOffset, endOffset>+ >+

... and pass *everything* the Highlighter needs to the Hit object.   
However, the offsets are never needed for scoring.

> I would also like a way to store the frequency of the term in the  
> overall collection (probably should go in the Term dictionary, but  
> not sure, at the cost of an additional VInt per term, but I am open  
> to other places to store it).  Right now, in order to calculate  
> this, one has to either store it separately at indexing time (using  
> a term counting Filter) or calculate it at runtime by looping over  
> the TermDocs and summing.

Sure, makes sense to me.  Sounds like a custom codec you'd define.   
(The following code has been swiped and adapted from TermBuffer...)

public class CollFreqCodec extends TermDictionaryCodec {
   private collFreq;

   public void readRecord (IndexInput input, FieldInfos fieldInfos)
     throws IOException {
     this.term = null;                           // invalidate cache
     int start = input.readVInt();
     int length = input.readVInt();
     int totalLength = start + length;
     setBytesLength(totalLength);
     input.readBytes(this.bytes, start, length);
     this.field = fieldInfos.fieldName(input.readVInt());
     this.collFreq = input.readVInt();
   }
}

That's not quite right, because I'm envisioning a codec rather than a  
TermBuffer subclass, but maybe you get the idea.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Flexible Indexing (was Re: Lucene Planning)

Posted by Grant Ingersoll <gs...@syr.edu>.

I thought it was you, but wasn't sure.

I would also like a way to store the frequency of the term in the 
overall collection (probably should go in the Term dictionary, but not 
sure, at the cost of an additional VInt per term, but I am open to other 
places to store it).  Right now, in order to calculate this, one has to 
either store it separately at indexing time (using a term counting 
Filter) or calculate it at runtime by looping over the TermDocs and 
summing. 

Marvin Humphrey wrote:
>
> On Jun 1, 2006, at 5:48 AM, Grant Ingersoll wrote:
>
>> Someone on the list a while ago suggested moving Term Vectors out of 
>> the postings and storing them separately, as then they don't have to 
>> be merged (but they doc ids would have to be kept up to date)
>
> Yes, that was me.  :)  I suggested storing  TermVector data alongside 
> stored field data, in the .fdt file.  That's what KinoSearch does 
> right now.  It cuts down on disk seeks.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

-- 

Grant Ingersoll 
Sr. Software Engineer 
Center for Natural Language Processing 
Syracuse University 
School of Information Studies 
335 Hinds Hall 
Syracuse, NY 13244 

http://www.cnlp.org 
Voice:  315-443-5484 
Fax: 315-443-6886 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Flexible Indexing (was Re: Lucene Planning)

Posted by Marvin Humphrey <ma...@rectangular.com>.

On Jun 1, 2006, at 5:48 AM, Grant Ingersoll wrote:

> Someone on the list a while ago suggested moving Term Vectors out  
> of the postings and storing them separately, as then they don't  
> have to be merged (but they doc ids would have to be kept up to date)

Yes, that was me.  :)  I suggested storing  TermVector data alongside  
stored field data, in the .fdt file.  That's what KinoSearch does  
right now.  It cuts down on disk seeks.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Flexible Indexing (was Re: Lucene Planning)

Posted by Grant Ingersoll <gs...@syr.edu>.


Marvin Humphrey wrote:
>   * Term Vectors (optional)

Someone on the list a while ago suggested moving Term Vectors out of the 
postings and storing them separately, as then they don't have to be 
merged (but they doc ids would have to be kept up to date)

-- 

Grant Ingersoll 
Sr. Software Engineer 
Center for Natural Language Processing 
Syracuse University 
School of Information Studies 
335 Hinds Hall 
Syracuse, NY 13244 

http://www.cnlp.org 
Voice:  315-443-5484 
Fax: 315-443-6886 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org