You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Roman (Jira)" <ji...@apache.org> on 2020/08/10 22:25:00 UTC

[jira] [Commented] (LUCENE-8776) Start offset going backwards has a legitimate purpose

    [ https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175081#comment-17175081 ] 

Roman commented on LUCENE-8776:
-------------------------------

I too suffer from the same issue, we have multi-token synonyms that can even overlap. I recognize the arguments against the backward offsets but I find them surprisingly backwards: they are saying that the implementation dictates function. When the function is what (for many people) is the goal. The arguments seem also to say that the most efficient implementation (non-negative integer deltas) does not allow backward offsets, therefore backward offsets is a bug. 

Please recognize, that the most elegant implementation sometimes mean "as complex as needed" – it is not the same as "the simplest". If negative vints consume 5 bytes instead of 4, some people need to and are willing to pay that price. Their use cases cannot be simply 'boxed' into the world where one is only looking ahead and never back (NLP is one such world)

Lucene is however inviting one particular solution:

The implementation of vint seems not mind if there is a negative offset (https://issues.apache.org/jira/browse/LUCENE-3738) and DefaultIndexingChain extends DocConsumer – the name 'Default' suggests that at some point in the past, Lucene developers wanted to provide other implementations. As it is *right now*, it is not easy to plug in a different 'DocConsumer' – that surely seems like an important omission! (one size fits all?). 

So if we just add a simple mechanism to instruct Lucene which DocConsumer to use, then all could be happy and not have to resort to dirty hacks or forks. The most efficient impl will be the default, yet will allow us us - dirty bastards - shoot ourselves in foot if we so desire. SOLR as well as ElasticSearch devs might not mind having the option in the future - it can come in handy. Wouldn't that be wonderful? Well, wonderful certainly not, just useful... could I do it? [~rcmuir] [~mikemccand] [~simonw]

 

 

 

> Start offset going backwards has a legitimate purpose
> -----------------------------------------------------
>
>                 Key: LUCENE-8776
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8776
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 7.6
>            Reporter: Ram Venkat
>            Priority: Major
>
> Here is the use case where startOffset can go backwards:
> Say there is a line "Organic light-emitting-diode glows", and I want to run span queries and highlight them properly. 
> During index time, light-emitting-diode is split into three words, which allows me to search for 'light', 'emitting' and 'diode' individually. The three words occupy adjacent positions in the index, as 'light' adjacent to 'emitting' and 'light' at a distance of two words from 'diode' need to match this word. So, the order of words after splitting are: Organic, light, emitting, diode, glows. 
> But, I also want to search for 'organic' being adjacent to 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'. 
> The way I solved this was to also generate 'light-emitting-diode' at two positions: (a) In the same position as 'light' and (b) in the same position as 'glows', like below:
> ||organic||light||emitting||diode||glows||
> | |light-emitting-diode| |light-emitting-diode| |
> |0|1|2|3|4|
> The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets are obviously the same. This works beautifully in Lucene 5.x in both searching and highlighting with span queries. 
> But when I try this in Lucene 7.6, it hits the condition "Offsets must not go backwards" at DefaultIndexingChain:818. This IllegalArgumentException is being thrown without any comments on why this check is needed. As I explained above, startOffset going backwards is perfectly valid, to deal with word splitting and span operations on these specialized use cases. On the other hand, it is not clear what value is added by this check and which highlighter code is affected by offsets going backwards. This same check is done at BaseTokenStreamTestCase:245. 
> I see others talk about how this check found bugs in WordDelimiter etc. but it also prevents legitimate use cases. Can this check be removed?  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org