You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Roman (Jira)" <ji...@apache.org> on 2020/08/14 16:15:00 UTC
[jira] [Comment Edited] (LUCENE-8776) Start offset going backwards has a legitimate purpose

    [ https://issues.apache.org/jira/browse/LUCENE-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177852#comment-17177852 ] 

Roman edited comment on LUCENE-8776 at 8/14/20, 4:14 PM:
---------------------------------------------------------

{{Sorry for crossposting (into the forum and here); I will try to study [~dweiss] example, but here is some useful writeup – please jump to the last example; where PositionLength attribute would fail us.}}
{code:java}
assertU(adoc("id", "603", "bibcode", "xxxxxxxxxx603",
         "title", "THE HUBBLE constant: a summary of the hubble space telescope program"));{code}
 
{code:java}
term=hubble posInc=2 posLen=1 type=word offsetStart=4 offsetEnd=10
 term=acr::hubble posInc=0 posLen=1 type=ACRONYM offsetStart=4 offsetEnd=10
 term=constant posInc=1 posLen=1 type=word offsetStart=11 offsetEnd=20
 term=summary posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=30
 term=hubble posInc=1 posLen=1 type=word offsetStart=38 offsetEnd=44
 term=syn::hubble space telescope posInc=0 posLen=3 type=SYNONYM offsetStart=38 offsetEnd=60
 term=syn::hst posInc=0 posLen=3 type=SYNONYM offsetStart=38 offsetEnd=60
 term=acr::hst posInc=0 posLen=3 type=ACRONYM offsetStart=38 offsetEnd=60
 * term=space posInc=1 posLen=1 type=word offsetStart=45 offsetEnd=50
 term=telescope posInc=1 posLen=1 type=word offsetStart=51 offsetEnd=60
 term=program posInc=1 posLen=1 type=word offsetStart=61 offsetEnd=68{code}
{{* - fails because of offsetEnd < lastToken.offsetEnd; If reordered (the multi-token synonym emitted as a last token) it would fail as well, because of the check for lastToken.beginOffset < currentToken.beginOffset. Basically, any reordering would result in a failure (unless offsets are trimmed).}}

{{The following example has additional twist because of `space-time`; the tokenizer first splits the word and generate two new tokens – those alternative tokens are then used to find synonyms (space == universe)}}
{code:java}
assertU(adoc("id", "605", "bibcode", "xxxxxxxxxx604",
         "title", "MIT and anti de sitter space-time"));{code}
{code:java}
term=xxxxxxxxxx604 posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=13
 term=mit posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=3
 term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3
 term=syn::massachusetts institute of technology posInc=0 posLen=1 type=SYNONYM offsetStart=0 offsetEnd=3
 term=syn::mit posInc=0 posLen=1 type=SYNONYM offsetStart=0 offsetEnd=3
 term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3
 term=anti posInc=1 posLen=1 type=word offsetStart=8 offsetEnd=12
 term=syn::ads posInc=0 posLen=4 type=SYNONYM offsetStart=8 offsetEnd=28
 term=acr::ads posInc=0 posLen=4 type=ACRONYM offsetStart=8 offsetEnd=28
 term=syn::anti de sitter space posInc=0 posLen=4 type=SYNONYM offsetStart=8 offsetEnd=28
 term=syn::antidesitter spacetime posInc=0 posLen=4 type=SYNONYM offsetStart=8 offsetEnd=28
 term=syn::antidesitter space posInc=0 posLen=4 type=SYNONYM offsetStart=8 offsetEnd=28
 * term=de posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=15
 term=sitter posInc=1 posLen=1 type=word offsetStart=16 offsetEnd=22
 term=space posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=28
 term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=23 offsetEnd=28
 term=time posInc=1 posLen=1 type=word offsetStart=29 offsetEnd=33
 term=spacetime posInc=0 posLen=1 type=word offsetStart=23 offsetEnd=33{code}
{{So far, all of these cases could be handled with the new position length attribute. But let us look at a case where that would fail too.}}
{code:java}
assertU(adoc("id", "606", "bibcode", "xxxxxxxxxx604",
         "title", "Massachusetts Institute of Technology and antidesitter space-time"));{code}
{code:java}
term=massachusetts posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=12
 term=syn::massachusetts institute of technology posInc=0 posLen=4 type=SYNONYM offsetStart=0 offsetEnd=36
 term=syn::mit posInc=0 posLen=4 type=SYNONYM offsetStart=0 offsetEnd=36
 term=acr::mit posInc=0 posLen=4 type=ACRONYM offsetStart=0 offsetEnd=36
 term=institute posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=22
 term=technology posInc=1 posLen=1 type=word offsetStart=26 offsetEnd=36
 term=antidesitter posInc=1 posLen=1 type=word offsetStart=41 offsetEnd=53
 term=syn::ads posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59
 term=acr::ads posInc=0 posLen=2 type=ACRONYM offsetStart=41 offsetEnd=59
 term=syn::anti de sitter space posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59
 term=syn::antidesitter spacetime posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59
 term=syn::antidesitter space posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59
 term=space posInc=1 posLen=1 type=word offsetStart=54 offsetEnd=59
 term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=54 offsetEnd=59
 term=time posInc=1 posLen=1 type=word offsetStart=60 offsetEnd=64
 term=spacetime posInc=0 posLen=1 type=word offsetStart=54 offsetEnd=64{code}
{{Notice the posLen=4 of MIT; it would cover tokens `massachusetts institute technology antidesitter` while offsets are still correct.}}


was (Author: roman.chyla@gmail.com):
{{Sorry for crossposting (into the forum and here); I will try to study [~dweiss] example, but here is some useful writeup – please jump to the last example; where PositionLength attribute would fail us.}}

{{`assertU(adoc("id", "603", "bibcode", "xxxxxxxxxx603",}}
{{        "title", "THE HUBBLE constant: a summary of the hubble space telescope program"));`}}


{{`term=hubble posInc=2 posLen=1 type=word offsetStart=4 offsetEnd=10}}
{{term=acr::hubble posInc=0 posLen=1 type=ACRONYM offsetStart=4 offsetEnd=10}}
{{term=constant posInc=1 posLen=1 type=word offsetStart=11 offsetEnd=20}}
{{term=summary posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=30}}
{{term=hubble posInc=1 posLen=1 type=word offsetStart=38 offsetEnd=44}}
{{term=syn::hubble space telescope posInc=0 posLen=3 type=SYNONYM offsetStart=38 offsetEnd=60}}
{{term=syn::hst posInc=0 posLen=3 type=SYNONYM offsetStart=38 offsetEnd=60}}
{{term=acr::hst posInc=0 posLen=3 type=ACRONYM offsetStart=38 offsetEnd=60}}
{{* term=space posInc=1 posLen=1 type=word offsetStart=45 offsetEnd=50}}
{{term=telescope posInc=1 posLen=1 type=word offsetStart=51 offsetEnd=60}}
{{term=program posInc=1 posLen=1 type=word offsetStart=61 offsetEnd=68`}}

{{* - fails because of offsetEnd < lastToken.offsetEnd; If reordered (the multi-token synonym emitted as a last token) it would fail as well, because of the check for lastToken.beginOffset < currentToken.beginOffset. Basically, any reordering would result in a failure (unless offsets are trimmed).}}



{{The following example has additional twist because of `space-time`; the tokenizer first splits the word and generate two new tokens -- those alternative tokens are then used to find synonyms (space == universe)}}

{{`assertU(adoc("id", "605", "bibcode", "xxxxxxxxxx604",}}
{{        "title", "MIT and anti de sitter space-time"));`}}


{{`term=xxxxxxxxxx604 posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=13}}
{{term=mit posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=3}}
{{term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3}}
{{term=syn::massachusetts institute of technology posInc=0 posLen=1 type=SYNONYM offsetStart=0 offsetEnd=3}}
{{term=syn::mit posInc=0 posLen=1 type=SYNONYM offsetStart=0 offsetEnd=3}}
{{term=acr::mit posInc=0 posLen=1 type=ACRONYM offsetStart=0 offsetEnd=3}}
{{term=anti posInc=1 posLen=1 type=word offsetStart=8 offsetEnd=12}}
{{term=syn::ads posInc=0 posLen=4 type=SYNONYM offsetStart=8 offsetEnd=28}}
{{term=acr::ads posInc=0 posLen=4 type=ACRONYM offsetStart=8 offsetEnd=28}}
{{term=syn::anti de sitter space posInc=0 posLen=4 type=SYNONYM offsetStart=8 offsetEnd=28}}
{{term=syn::antidesitter spacetime posInc=0 posLen=4 type=SYNONYM offsetStart=8 offsetEnd=28}}
{{term=syn::antidesitter space posInc=0 posLen=4 type=SYNONYM offsetStart=8 offsetEnd=28}}
{{* term=de posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=15}}
{{term=sitter posInc=1 posLen=1 type=word offsetStart=16 offsetEnd=22}}
{{term=space posInc=1 posLen=1 type=word offsetStart=23 offsetEnd=28}}
{{term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=23 offsetEnd=28}}
{{term=time posInc=1 posLen=1 type=word offsetStart=29 offsetEnd=33}}
{{term=spacetime posInc=0 posLen=1 type=word offsetStart=23 offsetEnd=33`}}

{{So far, all of these cases could be handled with the new position length attribute. But let us look at a case where that would fail too.}}

{{`assertU(adoc("id", "606", "bibcode", "xxxxxxxxxx604",}}
{{        "title", "Massachusetts Institute of Technology and antidesitter space-time"));`}}


{{`term=massachusetts posInc=1 posLen=1 type=word offsetStart=0 offsetEnd=12}}
{{term=syn::massachusetts institute of technology posInc=0 posLen=4 type=SYNONYM offsetStart=0 offsetEnd=36}}
{{term=syn::mit posInc=0 posLen=4 type=SYNONYM offsetStart=0 offsetEnd=36}}
{{term=acr::mit posInc=0 posLen=4 type=ACRONYM offsetStart=0 offsetEnd=36}}
{{term=institute posInc=1 posLen=1 type=word offsetStart=13 offsetEnd=22}}
{{term=technology posInc=1 posLen=1 type=word offsetStart=26 offsetEnd=36}}
{{term=antidesitter posInc=1 posLen=1 type=word offsetStart=41 offsetEnd=53}}
{{term=syn::ads posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59}}
{{term=acr::ads posInc=0 posLen=2 type=ACRONYM offsetStart=41 offsetEnd=59}}
{{term=syn::anti de sitter space posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59}}
{{term=syn::antidesitter spacetime posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59}}
{{term=syn::antidesitter space posInc=0 posLen=2 type=SYNONYM offsetStart=41 offsetEnd=59}}
{{term=space posInc=1 posLen=1 type=word offsetStart=54 offsetEnd=59}}
{{term=syn::universe posInc=0 posLen=1 type=SYNONYM offsetStart=54 offsetEnd=59}}
{{term=time posInc=1 posLen=1 type=word offsetStart=60 offsetEnd=64}}
{{term=spacetime posInc=0 posLen=1 type=word offsetStart=54 offsetEnd=64`}}

{{Notice the posLen=4 of MIT; it would cover tokens `massachusetts institute technology antidesitter` while offsets are still correct.}}

> Start offset going backwards has a legitimate purpose
> -----------------------------------------------------
>
>                 Key: LUCENE-8776
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8776
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 7.6
>            Reporter: Ram Venkat
>            Priority: Major
>         Attachments: LUCENE-8776-proof-of-concept.patch
>
>
> Here is the use case where startOffset can go backwards:
> Say there is a line "Organic light-emitting-diode glows", and I want to run span queries and highlight them properly. 
> During index time, light-emitting-diode is split into three words, which allows me to search for 'light', 'emitting' and 'diode' individually. The three words occupy adjacent positions in the index, as 'light' adjacent to 'emitting' and 'light' at a distance of two words from 'diode' need to match this word. So, the order of words after splitting are: Organic, light, emitting, diode, glows. 
> But, I also want to search for 'organic' being adjacent to 'light-emitting-diode' or 'light-emitting-diode' being adjacent to 'glows'. 
> The way I solved this was to also generate 'light-emitting-diode' at two positions: (a) In the same position as 'light' and (b) in the same position as 'glows', like below:
> ||organic||light||emitting||diode||glows||
> | |light-emitting-diode| |light-emitting-diode| |
> |0|1|2|3|4|
> The positions of the two 'light-emitting-diode' are 1 and 3, but the offsets are obviously the same. This works beautifully in Lucene 5.x in both searching and highlighting with span queries. 
> But when I try this in Lucene 7.6, it hits the condition "Offsets must not go backwards" at DefaultIndexingChain:818. This IllegalArgumentException is being thrown without any comments on why this check is needed. As I explained above, startOffset going backwards is perfectly valid, to deal with word splitting and span operations on these specialized use cases. On the other hand, it is not clear what value is added by this check and which highlighter code is affected by offsets going backwards. This same check is done at BaseTokenStreamTestCase:245. 
> I see others talk about how this check found bugs in WordDelimiter etc. but it also prevents legitimate use cases. Can this check be removed?  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org