You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Christoph Goller <go...@detego-software.de> on 2004/10/01 13:21:15 UTC

Re: Term highlighting and Term vector patch

Grant Ingersoll wrote:
> Hi,
> 
> I was browsing the term highlighting code in the sandbox and I noticed
> the following comment for the getBestFragment method in the
> Highlighter.java code:
> 
> 	/**
> ...
> 	 * @param tokenStream   a stream of tokens identified in the
> text parameter, including offset information. 
> 	 * This is typically produced by an analyzer re-parsing a
> document's 
> 	 * text. Some work may be done on retrieving TokenStreams more
> efficently 
> 	 * by adding support for storing original text position data in
> the Lucene
> 	 * index but this support is not currently available (as of
> Lucene 1.4 rc2).  
> ...
> 	 */
> 
> which struck me that I might be able to contribute some more time to
> make this so, since I recently submitted a patch to offer just such an
> enhancement to the term vector.
> 
> I would like to implement this, but I don't really want to submit a
> patch against another patch (It's hard enough managing all the changes
> that come down).  So, I was wondering if anyone (i.e. a committer) has
> had a chance to look at the Term Vector offset patch and what their
> thoughts are on it?  I can see the performance improvements in the
> highlighter that would come about by avoiding having to re-analyze the
> text, plus you could highlight the whole field if you wanted to.
> 
> Also, if I make this change, do the committers suggest I keep the
> current ability to analyze and have this as an alternative, or would it
> be safe to assume this is only used when offset info is stored?

Hi Grant,

as promised, I am currently looking through your patch. So please, be patient
for some more days. I stumbled over something in the current implementation
that took me some hours to understand and test. In the txd-file you store field
numbers. You are using difference-encoding (store the differences of field 
numbers, not their absolute values) and variable-length integers. The
problem is that FieldInfos not necesarily store fields in alphabetical order.
No order is guranteed at all and order can change from segment to segment, as
well as the field numbers themselves. This means that the field numbers you are
writing into the txd-file are not necessarily in increasing order and you can
get negative entries with the difference encoding. Variable-length intergers due
to their specification (e.g. IndexInput.readVInt()) only work for positive
numbers. All this was difficult to test, ... ,

The result is: It really is as described above, but luckily, variable-length
integers also work for negative numbers. So termVerctors work as they should.
However, I will change from difference encoding for the field numbers to normal
encoding. I think usualy one does not have more than 256 different fields and so
difference encoding is not necessary. Furthermore, negative numbers always take
4 bytes as variable-length integer, so difference encoding actually needs more
space than normal encoding here. Note that of course difference encoding for
positions remains unchanged since it definitely is very effective here.

Christoph










---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org