You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2006/06/01 01:44:02 UTC
Re: Flexible Indexing (was Re: Lucene Planning)
[wild brainstorming...]
Another reason to consolidate the freqs, positions, and boosts/norms
into one file: we can isolate and distill the code that encodes/
decodes that file into a plugin, weakening the current tight coupling
between Lucene and its file format. Changing that index format might
then be a little less painful, as we'd just write a new plugin but
leave the old one sitting there. We may not be able to write plugin
code for the an entire index, but we can write some for each file.
I'm imagining a PostingsWriter interface that each plugin would
implement, then a complementary PostingsReader. PostingsReader would
look a lot like TermPositions does now, but would add getBoost(). To
this, a POSPostingsReader subclass might add getPartOfSpeech().
In addition to the postings file, we might want a stored fields file
plugin. Maybe call those interfaces DBWriter and DBReader. This is
trickier, because stored fields are not inverted, so if we used
different codecs for each field, their output would have to be
interleaved. Bleah. Seems more like we'd want to use a plugin for
the entire file, with a limited selection of per-field options.
Each segment would have a file recording which codecs were in use.
Each field name, once associated with a codec, could not be modified
to use another. No more reconciliation of indexed/notIndexed,
omitNorms/notOmitNorms.
Does it make sense then to have the Term Dictionary as a plugin? I
think so. But maybe rather than ordering all terms first by field
name then by term text, each indexed field should have its own
dictionary file, ordered by term text. Then the dictionary file
could have per-field customization as well.
The point of this exercise is to generalize the high level data
structures required by an inverted indexing engine.
* Term Dictionary
* Postings
* Stored Fields Database
* Term Vectors (optional)
In my view, each of these should have its own pluggable codec.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Flexible Indexing (was Re: Lucene Planning)
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Jun 2, 2006, at 6:48 AM, Grant Ingersoll wrote:
> I thought it was you, but wasn't sure.
I'm always looking for ways to minimize Term Vectors, because I
consider excerpting/highlighting a core feature rather than an add-
on, and they seem like such overkill. It bothers me that they
duplicate so much information.
I've been toying with the idea of a hitCollector.collect(int docNum,
float score, ScorePositions[] scorePositions) method -- or, more
likely, a hitCollector.collect(Scorer scorer) method -- that would
preserve each position that contributed to the score of a document
and how much it contributed, allowing that information to be passed
through a Hit object to the Highlighter.
That might be complemented storing the startOffsets and endOffsets
for each field as streams of delta-encoded VInts along with the
stored field data. Conceptually, it would be even cleaner to keep
startOffsets and endOffsets in the postings...
a. <doc>+
b. <doc, boost>+
c. <doc, freq, <position>+ >+
d. <doc, freq, <position, boost>+ >+
e. <doc, freq, <position, boost, startOffset, endOffset>+ >+
... and pass *everything* the Highlighter needs to the Hit object.
However, the offsets are never needed for scoring.
> I would also like a way to store the frequency of the term in the
> overall collection (probably should go in the Term dictionary, but
> not sure, at the cost of an additional VInt per term, but I am open
> to other places to store it). Right now, in order to calculate
> this, one has to either store it separately at indexing time (using
> a term counting Filter) or calculate it at runtime by looping over
> the TermDocs and summing.
Sure, makes sense to me. Sounds like a custom codec you'd define.
(The following code has been swiped and adapted from TermBuffer...)
public class CollFreqCodec extends TermDictionaryCodec {
private collFreq;
public void readRecord (IndexInput input, FieldInfos fieldInfos)
throws IOException {
this.term = null; // invalidate cache
int start = input.readVInt();
int length = input.readVInt();
int totalLength = start + length;
setBytesLength(totalLength);
input.readBytes(this.bytes, start, length);
this.field = fieldInfos.fieldName(input.readVInt());
this.collFreq = input.readVInt();
}
}
That's not quite right, because I'm envisioning a codec rather than a
TermBuffer subclass, but maybe you get the idea.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Flexible Indexing (was Re: Lucene Planning)
Posted by Grant Ingersoll <gs...@syr.edu>.
I thought it was you, but wasn't sure.
I would also like a way to store the frequency of the term in the
overall collection (probably should go in the Term dictionary, but not
sure, at the cost of an additional VInt per term, but I am open to other
places to store it). Right now, in order to calculate this, one has to
either store it separately at indexing time (using a term counting
Filter) or calculate it at runtime by looping over the TermDocs and
summing.
Marvin Humphrey wrote:
>
> On Jun 1, 2006, at 5:48 AM, Grant Ingersoll wrote:
>
>> Someone on the list a while ago suggested moving Term Vectors out of
>> the postings and storing them separately, as then they don't have to
>> be merged (but they doc ids would have to be kept up to date)
>
> Yes, that was me. :) I suggested storing TermVector data alongside
> stored field data, in the .fdt file. That's what KinoSearch does
> right now. It cuts down on disk seeks.
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
--
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org
Voice: 315-443-5484
Fax: 315-443-6886
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Flexible Indexing (was Re: Lucene Planning)
Posted by Marvin Humphrey <ma...@rectangular.com>.
On Jun 1, 2006, at 5:48 AM, Grant Ingersoll wrote:
> Someone on the list a while ago suggested moving Term Vectors out
> of the postings and storing them separately, as then they don't
> have to be merged (but they doc ids would have to be kept up to date)
Yes, that was me. :) I suggested storing TermVector data alongside
stored field data, in the .fdt file. That's what KinoSearch does
right now. It cuts down on disk seeks.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Flexible Indexing (was Re: Lucene Planning)
Posted by Grant Ingersoll <gs...@syr.edu>.
Marvin Humphrey wrote:
> * Term Vectors (optional)
Someone on the list a while ago suggested moving Term Vectors out of the
postings and storing them separately, as then they don't have to be
merged (but they doc ids would have to be kept up to date)
--
Grant Ingersoll
Sr. Software Engineer
Center for Natural Language Processing
Syracuse University
School of Information Studies
335 Hinds Hall
Syracuse, NY 13244
http://www.cnlp.org
Voice: 315-443-5484
Fax: 315-443-6886
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org