You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Renaud Delbru <re...@deri.org> on 2008/03/04 19:04:48 UTC

Incorrect Token Offset when using multiple fieldable instance

Hi,

I currently use multiple fieldable instances for indexing sentences of a 
document.
When there is only one single fieldable instance, the token offset 
generation performed in DocumentWriter is correct.
The problem appears when there is two or more fieldable instances. In 
DocumentWriter$FieldData#invertField method, if the field is tokenized, 
instead of updating offset attribute with stringValue.length() (which is 
performed if the field is not tokenized, line 1458), you update the 
offset attribute with the end offset of the last token (line 1503: 
offset = offsetEnd+1;).
As a consequence, if a token has been filtered (for example a stopword, 
a dot, a space, etc.), the offset attribute is updated with the end 
offset of the last token not filtered. In this case, you store inside 
the offset attribute an incorrect offset (the offset is shift back) and 
all the next fieldable instances will have their offset shifted back.

Is it a bug ? Or is it a desired behavior (in this case, why ?) ?

Regards.

-- 
Renaud Delbru,
E.C.S., Ph.D. Student,
Semantic Information Systems and
Language Engineering Group (SmILE),
Digital Enterprise Research Institute,
National University of Ireland, Galway.
http://smile.deri.ie/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Incorrect Token Offset when using multiple fieldable instance

Posted by Michael McCandless <lu...@mikemccandless.com>.

Toph wrote:

> Michael McCandless-2 wrote:
>>
>>
>> We could alternatively extend TokenStream so you could query it for
>> the final offset, then fix indexing to use that value instead of the
>> endOffset of the last token that it saw.
>>
>>
>
> Querying the tokenstream for the final offset would good, but then  
> would the
> change be put into the DocumentWriter directly or available as an  
> option?

I would put the change into DocumentsWriter directly (ie running by  
default) with an option to enable the old (buggy) behavior for those  
apps that have workarounds and want to get back to the back-compatible  
behavior.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Incorrect Token Offset when using multiple fieldable instance

Posted by Toph <ch...@gmail.com>.

Michael McCandless-2 wrote:
> 
> 
> This would actually be a fairly large change: it's a change to the  
> index format and all APIs that handle offsets during indexing &  
> searching/retrieving.
> 
> 

For now I just changed the offset calculation in DocumentWriter as specified
here by the OP:

> replace DocumentWriter$FieldData#invertField offset = offsetEnd+1; by
> offset = stringValue.length(); 
> 

It has side effects as previously mentioned on this list, e.g. if the
tokenstream is not backed by a stringValue or the Analyzer does not
calculate offsets in the normal way.  But for my purposes it works.

This issue was also discussed previously 
http://lucene.markmail.org/search/?q=offset%20documentwriter#query:offset%20documentwriter+page:1+mid:l6jbfmfisyg5zyre+state:results
here .

Michael McCandless-2 wrote:
> 
> 
> We could alternatively extend TokenStream so you could query it for  
> the final offset, then fix indexing to use that value instead of the  
> endOffset of the last token that it saw.
> 
> 

Querying the tokenstream for the final offset would good, but then would the
change be put into the DocumentWriter directly or available as an option?

Chris
-- 
View this message in context: http://www.nabble.com/Incorrect-Token-Offset-when-using-multiple-fieldable-instance-tp15833468p18238566.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Incorrect Token Offset when using multiple fieldable instance

Posted by Michael McCandless <lu...@mikemccandless.com>.

This would actually be a fairly large change: it's a change to the  
index format and all APIs that handle offsets during indexing &  
searching/retrieving.

We could alternatively extend TokenStream so you could query it for  
the final offset, then fix indexing to use that value instead of the  
endOffset of the last token that it saw.

Mike

Toph wrote:

>
> Interesting discussion... glad I'm not the only one with this  
> challenge.
>
>
> Michael McCandless-2 wrote:
>>
>> EG, if you use Highlighter on a
>> multi-valued field indexed with stored field & term vectors and say
>> the first field ended with a stop word that was filtered out, then
>> your offsets will be off and the wrong parts will be highlighted
>>
>
> I found this post by attempting just this exact thing, and I can  
> confirm,
> that yes, the offsets are incorrect for all but the first instance  
> of the
> field in the document, so they are useless for highlighting.  I tried
> concatenating all instances of the fields, but of course if an  
> instance of
> the field ended with punctuation or a stop word, those characters  
> were not
> added to the offset.  I'll try the suggested workaround re adding a  
> false
> term at the end of each field, but a better API would be if "offset"  
> became
> a pair of ints, first being the index of the Field for  
> getFields(name) and
> the second being the offset in that instance of the field.
>
> Christopher
> -- 
> View this message in context: http://www.nabble.com/Incorrect-Token-Offset-when-using-multiple-fieldable-instance-tp15833468p18206216.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Incorrect Token Offset when using multiple fieldable instance

Posted by Toph <ch...@gmail.com>.

Interesting discussion... glad I'm not the only one with this challenge.

Michael McCandless-2 wrote:
> 
> EG, if you use Highlighter on a  
> multi-valued field indexed with stored field & term vectors and say  
> the first field ended with a stop word that was filtered out, then  
> your offsets will be off and the wrong parts will be highlighted 
> 

I found this post by attempting just this exact thing, and I can confirm,
that yes, the offsets are incorrect for all but the first instance of the
field in the document, so they are useless for highlighting.  I tried
concatenating all instances of the fields, but of course if an instance of
the field ended with punctuation or a stop word, those characters were not
added to the offset.  I'll try the suggested workaround re adding a false
term at the end of each field, but a better API would be if "offset" became
a pair of ints, first being the index of the Field for getFields(name) and
the second being the offset in that instance of the field.

Christopher
-- 
View this message in context: http://www.nabble.com/Incorrect-Token-Offset-when-using-multiple-fieldable-instance-tp15833468p18206216.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Incorrect Token Offset when using multiple fieldable instance

Posted by Michael McCandless <lu...@mikemccandless.com>.

Well, first off, sometimes the thing being indexed isn't a string, so  
you have no stringValue to get its length.  It could be a Reader or a  
TokenStream.

Second off, it's conceivable that an analyzer computes its own  
"interesting" offsets that are not in fact simple indices into the  
stringValue, though I would expect that to be the exception not the  
rule.

I can't think of any other harm ... so if neither of these apply in  
your situation then it should be OK?

I do agree this seems like a bug.  EG, if you use Highlighter on a  
multi-valued field indexed with stored field & term vectors and say  
the first field ended with a stop word that was filtered out, then  
your offsets will be off and the wrong parts will be highlighted in  
all but the first field (I think?).  I think we really need some way  
for the tokenStream to "declare" its final offset at the end.

Mike

Renaud Delbru wrote:

> Do you know if there will be side-effects if we replace in  
> DocumentWriter$FieldData#invertField
> offset = offsetEnd+1;
> by
> offset = stringValue.length();
>
> I still not understand the reason of such choice for the  
> incrementation of the start offset.
>
> Regards.
>
> Michael McCandless wrote:
>>
>> This is how Lucene has worked for quite some time (since 1.9).
>>
>> When there are multiple fields with the same name in one Document,  
>> each field's offset starts from the last offset (offset of the  
>> last token) seen in the previous field.  If tokens are skipped at  
>> the end there's no way IndexWriter can know (because tokenStream  
>> doesn't return them).  It's as if we need the ability to query a  
>> tokenStream for its "final" offset or something.
>>
>> One workaround might be to insert an "end marker" token, with the  
>> true end offset, which is a term you would never search on?
>>
>> Mike
>>
>> Renaud Delbru wrote:
>>
>>> Hi,
>>>
>>> I currently use multiple fieldable instances for indexing  
>>> sentences of a document.
>>> When there is only one single fieldable instance, the token  
>>> offset generation performed in DocumentWriter is correct.
>>> The problem appears when there is two or more fieldable  
>>> instances. In DocumentWriter$FieldData#invertField method, if the  
>>> field is tokenized, instead of updating offset attribute with  
>>> stringValue.length() (which is performed if the field is not  
>>> tokenized, line 1458), you update the offset attribute with the  
>>> end offset of the last token (line 1503: offset = offsetEnd+1;).
>>> As a consequence, if a token has been filtered (for example a  
>>> stopword, a dot, a space, etc.), the offset attribute is updated  
>>> with the end offset of the last token not filtered. In this case,  
>>> you store inside the offset attribute an incorrect offset (the  
>>> offset is shift back) and all the next fieldable instances will  
>>> have their offset shifted back.
>>>
>>> Is it a bug ? Or is it a desired behavior (in this case, why ?) ?
> -- 
> Renaud Delbru,
> E.C.S., Ph.D. Student,
> Semantic Information Systems and
> Language Engineering Group (SmILE),
> Digital Enterprise Research Institute,
> National University of Ireland, Galway.
> http://smile.deri.ie/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Incorrect Token Offset when using multiple fieldable instance

Posted by Renaud Delbru <re...@deri.org>.

Do you know if there will be side-effects if we replace in 
DocumentWriter$FieldData#invertField
offset = offsetEnd+1;
by
offset = stringValue.length();

I still not understand the reason of such choice for the incrementation 
of the start offset.

Regards.

Michael McCandless wrote:
>
> This is how Lucene has worked for quite some time (since 1.9).
>
> When there are multiple fields with the same name in one Document, 
> each field's offset starts from the last offset (offset of the last 
> token) seen in the previous field.  If tokens are skipped at the end 
> there's no way IndexWriter can know (because tokenStream doesn't 
> return them).  It's as if we need the ability to query a tokenStream 
> for its "final" offset or something.
>
> One workaround might be to insert an "end marker" token, with the true 
> end offset, which is a term you would never search on?
>
> Mike
>
> Renaud Delbru wrote:
>
>> Hi,
>>
>> I currently use multiple fieldable instances for indexing sentences 
>> of a document.
>> When there is only one single fieldable instance, the token offset 
>> generation performed in DocumentWriter is correct.
>> The problem appears when there is two or more fieldable instances. In 
>> DocumentWriter$FieldData#invertField method, if the field is 
>> tokenized, instead of updating offset attribute with 
>> stringValue.length() (which is performed if the field is not 
>> tokenized, line 1458), you update the offset attribute with the end 
>> offset of the last token (line 1503: offset = offsetEnd+1;).
>> As a consequence, if a token has been filtered (for example a 
>> stopword, a dot, a space, etc.), the offset attribute is updated with 
>> the end offset of the last token not filtered. In this case, you 
>> store inside the offset attribute an incorrect offset (the offset is 
>> shift back) and all the next fieldable instances will have their 
>> offset shifted back.
>>
>> Is it a bug ? Or is it a desired behavior (in this case, why ?) ?
-- 
Renaud Delbru,
E.C.S., Ph.D. Student,
Semantic Information Systems and
Language Engineering Group (SmILE),
Digital Enterprise Research Institute,
National University of Ireland, Galway.
http://smile.deri.ie/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Incorrect Token Offset when using multiple fieldable instance

Posted by Michael McCandless <lu...@mikemccandless.com>.

This is how Lucene has worked for quite some time (since 1.9).

When there are multiple fields with the same name in one Document,  
each field's offset starts from the last offset (offset of the last  
token) seen in the previous field.  If tokens are skipped at the end  
there's no way IndexWriter can know (because tokenStream doesn't  
return them).  It's as if we need the ability to query a tokenStream  
for its "final" offset or something.

One workaround might be to insert an "end marker" token, with the  
true end offset, which is a term you would never search on?

Mike

Renaud Delbru wrote:

> Hi,
>
> I currently use multiple fieldable instances for indexing sentences  
> of a document.
> When there is only one single fieldable instance, the token offset  
> generation performed in DocumentWriter is correct.
> The problem appears when there is two or more fieldable instances.  
> In DocumentWriter$FieldData#invertField method, if the field is  
> tokenized, instead of updating offset attribute with  
> stringValue.length() (which is performed if the field is not  
> tokenized, line 1458), you update the offset attribute with the end  
> offset of the last token (line 1503: offset = offsetEnd+1;).
> As a consequence, if a token has been filtered (for example a  
> stopword, a dot, a space, etc.), the offset attribute is updated  
> with the end offset of the last token not filtered. In this case,  
> you store inside the offset attribute an incorrect offset (the  
> offset is shift back) and all the next fieldable instances will  
> have their offset shifted back.
>
> Is it a bug ? Or is it a desired behavior (in this case, why ?) ?
>
> Regards.
>
> -- 
> Renaud Delbru,
> E.C.S., Ph.D. Student,
> Semantic Information Systems and
> Language Engineering Group (SmILE),
> Digital Enterprise Research Institute,
> National University of Ireland, Galway.
> http://smile.deri.ie/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org