You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Martin Hermann (JIRA)" <ji...@apache.org> on 2018/08/28 13:57:00 UTC

[jira] [Commented] (LUCENE-8196) Add IntervalQuery and IntervalsSource to expose minimum interval semantics across term fields

    [ https://issues.apache.org/jira/browse/LUCENE-8196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16595002#comment-16595002 ] 

Martin Hermann commented on LUCENE-8196:
----------------------------------------

First of all, I really like this implementation and the ideas that went into it. But as I have spent quite some time with the old span queries and their problems, I'd like to comment on some things and maybe offer some fresh view points for old problems:
 
 
Obviously, maxwidth is not completely identical to specifying slop: Let's say we want to do some sort of synonym expansion and query for "("big bad" OR evil) wolf" (this is of course related to the prefix-problem we already know about ("genome editing"), but I think still slightly different).
With span queries, this would have been possible, as we just have to set slop to 0 in all queries, but now we have to do something like
{code:java}
Intervals.maxwidth(3,
        Intervals.ordered(
            Intervals.or(
                Intervals.maxwidth(2,
                    Intervals.ordered(Intervals.term("big"),Intervals.term("bad"))),
                Intervals.term("evil")),
            Intervals.term("wolf")));{code}
which also matches "evil eyes wolf", which should not be a match. It would be possible to rewrite the query so that the disjunction is at the top level, something like
{code:java}
Intervals.or(
    Intervals.maxwidth(2,
        Intervals.ordered(Intervals.term("evil"),Intervals.term("wolf"))),
    Intervals.maxwidth(3,
        Intervals.ordreed(Intervals.term("big"),Intervals.term("bad"),Intervals.term("wolf"))));{code}

which would work as expected, but I think we can agree that this is not really a nice solution (but I will come back to it later).

 

Now, we already know that "(big OR "big bad") wolf" would not match "big bad wolf" (this is exactly the genome editing thing), but I think it is worth to point out exactly why: It actually should not match, according to the definition of "minimum interval": Any match for "big bad" is also a match for big, so the first IntervalsSource only passes matches for "big", and then we get no match for "big wolf". This is a feature of the query semantic of the paper (and maybe the reason for the efficency and simplicity of the algorithms): The problems that spanQueries had are gone, because we define the unexpected behaviour to be correct*. As much as I like the IntervalQueries, I do not really think this is satisfactory.

 
There are actually other, similar cases with containing/containedBy: Let's say our document is "big bad big wolf" and we want "bad wolf" (slop 1) to be contained by "big wolf" (slop 2). We would get no match in this document, as the minimal match for the big interval is just "big wolf" (as the other match, "big bad big wolf" contains this one). At least to me this is counter intuitive and I would expect the document to match.
It really gets strange if we mix in some "OR":
{noformat}
"big wolf" (slop 1) contained in ("big wolf" (slop 1) OR "bad wolf") {noformat}
does not match "big bad wolf", in contrast to
{noformat}
"big wolf" (slop 1) contained in ("big wolf" (slop 1)){noformat}
, which does. So we actually lose a match by adding a OR-clause, and I think we can agree that this is not really good. Of course these are not queries a human would write, but I think one major use case of span queries is some sort of automatic query generation, and that's where I think it is really important to meet at least some basic expectations (such as not losing matches by adding disjunctions).
 
I don't see a way to fix this that still follows minimal interval semantics, as all this is actually how it SHOULD work there, but this would mean we'd lose the correctness proofs. The only thing I can think of is some sort of query rewriting, pushing the disjunction as far top as neccessary, but this may be rather performance heavy and also does not solve the "bad wolf" (slop 1) contained by "big wolf" (slop 2) problem.
 
Any thoughts?

*A short theoretical aside: I think that most of the span query problems came from the fact that we want to have a "next match" function, i.e. some sort of ordering of matches, together with the nature of span query Matches, which are essentially a pair of numbers (start and end of match). This means we have to specify an order on pairs of numbers (which is possible, of course; the solution with span queries was a lexical order, i.e. the start always increases, and if it stays the same, the end increases). But I think it is not really possible to implement completly lazy behaviour with this ordering: Think of some ordered "((a OR b) followed by (c OR d)) with enough slop" and the document "a b c d" which should find "a b c d" before "b c" (as the start increases), but has to cache the match for "c", which (in the sub-query "(c OR d)) occurs before the one for "b". So the combination of ordering and lazy is where we get problems. The vigna paper is now quite clever, in that the very definition of "minimum interval" solves this problem, as the pair of start and end positions gets essentially reduced to a single number (e.g. start position), as it is not possible for one of them to increase with the other staying the same or even decreasing (as otherwise one of the matches would be contained in the other). I actually think that a possible solution would be to give up the "lazy" part instead and have some kind of caching, but that is not what this ticket is about.

> Add IntervalQuery and IntervalsSource to expose minimum interval semantics across term fields
> ---------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8196
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8196
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>             Fix For: 7.4
>
>         Attachments: LUCENE-8196-debug.patch, LUCENE-8196.patch, LUCENE-8196.patch, LUCENE-8196.patch, LUCENE-8196.patch, LUCENE-8196.patch
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> This ticket proposes an alternative implementation of the SpanQuery family that uses minimum-interval semantics from [http://vigna.di.unimi.it/ftp/papers/EfficientAlgorithmsMinimalIntervalSemantics.pdf] to implement positional queries across term-based fields.  Rather than using TermQueries to construct the interval operators, as in LUCENE-2878 or the current Spans implementation, we instead use a new IntervalsSource object, which will produce IntervalIterators over a particular segment and field.  These are constructed using various static helper methods, and can then be passed to a new IntervalQuery which will return documents that contain one or more intervals so defined.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org