You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Ram Venkat (JIRA)" <ji...@apache.org> on 2019/04/24 07:25:00 UTC

[jira] [Comment Edited] (LUCENE-8776) Start offset going backwards has a legitimate purpose

    [ https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824866#comment-16824866 ] 

Ram Venkat edited comment on LUCENE-8776 at 4/24/19 7:24 AM:
-------------------------------------------------------------

Robert - Thanks for the quick, detailed response. 

If negative deltas of offsets breaks postings, shouldn't the check be included only if postings are used? 

If performance gets worse for large documents, isn't it better to just log a warning, rather than completely remove that feature? Net performance depends on other factors like hardware, right?

We use the default highlighter with term vectors. Functionally, offsets going backwards works well as I explained in my previous post. We do extensive performance tests and we do not have an issue there either. Unified highlighter is not option for us at this point, as it does not support SurroundParser yet. 

At this point, we are forced to remove this check and recompile the source. Instead, can we move this check to where postings are used?

 


was (Author: venkat11):
Robert - Thanks for the quick, detailed response. 

If negative deltas of offsets breaks postings, shouldn't the check be included only if postings are used? 

If performance gets worse for large documents, isn't it better to just log a warning, rather than completely remove that feature? Net performance depends on other factors like hardware, even for \{{O(n^2)}, right?

We use the default highlighter with term vectors. Functionally, offsets going backwards works well as I explained in my previous post. We do extensive performance tests and we do not have an issue there either. Unified highlighter is not option for us at this point, as it does not support SurroundParser yet. 

At this point, we are forced to remove this check and recompile the source. Instead, can we move this check to where postings are used?

 

> Start offset going backwards has a legitimate purpose
> -----------------------------------------------------
>
>                 Key: LUCENE-8776
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8776
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 7.6
>            Reporter: Ram Venkat
>            Priority: Major
>
> Here is the use case where startOffset can go backwards:
> Say there is a line "Organic light-emitting-diode glows", and I want to run span queries and highlight them properly. 
> During index time, light-emitting-diode is split into three words, which allows me to search for 'light', 'emitting' and 'diode' individually. The three words occupy adjacent positions in the index, as 'light' adjacent to 'emitting' and 'light' at a distance of two words from 'diode' need to match this word. So, the order of words after splitting are: Organic, light, emitting, diode, glows. 
> But, I also want to search for 'organic' being adjacent to 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'. 
> The way I solved this was to also generate 'light-emitting-diode' at two positions: (a) In the same position as 'light' and (b) in the same position as 'glows', like below:
> ||organic||light||emitting||diode||glows||
> | |light-emitting-diode| |light-emitting-diode| |
> |0|1|2|3|4|
> The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets are obviously the same. This works beautifully in Lucene 5.x in both searching and highlighting with span queries. 
> But when I try this in Lucene 7.6, it hits the condition "Offsets must not go backwards" at DefaultIndexingChain:818. This IllegalArgumentException is being thrown without any comments on why this check is needed. As I explained above, startOffset going backwards is perfectly valid, to deal with word splitting and span operations on these specialized use cases. On the other hand, it is not clear what value is added by this check and which highlighter code is affected by offsets going backwards. This same check is done at BaseTokenStreamTestCase:245. 
> I see others talk about how this check found bugs in WordDelimiter etc. but it also prevents legitimate use cases. Can this check be removed?  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org