You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Doron Cohen (JIRA)" <ji...@apache.org> on 2011/05/19 11:13:52 UTC

[jira] [Created] (LUCENE-3120) span query matches too many docs when two query terms are the same unless inOrder=true

span query matches too many docs when two query terms are the same unless inOrder=true
--------------------------------------------------------------------------------------

                 Key: LUCENE-3120
                 URL: https://issues.apache.org/jira/browse/LUCENE-3120
             Project: Lucene - Java
          Issue Type: Bug
          Components: core/search
            Reporter: Doron Cohen
            Priority: Minor
             Fix For: 3.2, 4.0


spinoff of user list discussion - [SpanNearQuery - inOrder parameter|http://markmail.org/message/i4cstlwgjmlcfwlc].

With 3 documents:
*  "a b x c d"
*  "a b b d"
*  "a b x b y d"

Here are a few queries (the number in parenthesis indicates expected #hits):


These ones work *as expected*:
* (1)  in-order, slop=0, "b", "x", "b"
* (1)  in-order, slop=0, "b", "b"
* (2)  in-order, slop=1, "b", "b"

These ones match *too many* hits:
* (1)  any-order, slop=0, "b", "x", "b"
* (1)  any-order, slop=1, "b", "x", "b"
* (1)  any-order, slop=2, "b", "x", "b"
* (1)  any-order, slop=3, "b", "x", "b"

These ones match *too many* hits as well:
* (1)  any-order, slop=0, "b", "b"
* (2)  any-order, slop=1, "b", "b"

Each of the above passes when using a phrase query (applying the slop, no in-order indication in phrase query).

This seems related to a known overlapping spans issue - [non-overlapping Span queries|http://markmail.org/message/7jxn5eysjagjwlon] - as indicated by Hoss, so we might decide to close this bug after all, but I would like to at least have the junit that exposes the behavior in JIRA.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3120) span query matches too many docs when two query terms are the same unless inOrder=true

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doron Cohen updated LUCENE-3120:
--------------------------------

    Attachment: LUCENE-3120.patch

Attached test case demonstrating the bug.

> span query matches too many docs when two query terms are the same unless inOrder=true
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3120
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/search
>            Reporter: Doron Cohen
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3120.patch
>
>
> spinoff of user list discussion - [SpanNearQuery - inOrder parameter|http://markmail.org/message/i4cstlwgjmlcfwlc].
> With 3 documents:
> *  "a b x c d"
> *  "a b b d"
> *  "a b x b y d"
> Here are a few queries (the number in parenthesis indicates expected #hits):
> These ones work *as expected*:
> * (1)  in-order, slop=0, "b", "x", "b"
> * (1)  in-order, slop=0, "b", "b"
> * (2)  in-order, slop=1, "b", "b"
> These ones match *too many* hits:
> * (1)  any-order, slop=0, "b", "x", "b"
> * (1)  any-order, slop=1, "b", "x", "b"
> * (1)  any-order, slop=2, "b", "x", "b"
> * (1)  any-order, slop=3, "b", "x", "b"
> These ones match *too many* hits as well:
> * (1)  any-order, slop=0, "b", "b"
> * (2)  any-order, slop=1, "b", "b"
> Each of the above passes when using a phrase query (applying the slop, no in-order indication in phrase query).
> This seems related to a known overlapping spans issue - [non-overlapping Span queries|http://markmail.org/message/7jxn5eysjagjwlon] - as indicated by Hoss, so we might decide to close this bug after all, but I would like to at least have the junit that exposes the behavior in JIRA.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3120) span query matches too many docs when two query terms are the same unless inOrder=true

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doron Cohen updated LUCENE-3120:
--------------------------------

    Attachment: LUCENE-3120.patch

Updated patch with fixed test to not depend on analysis module.

> span query matches too many docs when two query terms are the same unless inOrder=true
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3120
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/search
>            Reporter: Doron Cohen
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3120.patch, LUCENE-3120.patch
>
>
> spinoff of user list discussion - [SpanNearQuery - inOrder parameter|http://markmail.org/message/i4cstlwgjmlcfwlc].
> With 3 documents:
> *  "a b x c d"
> *  "a b b d"
> *  "a b x b y d"
> Here are a few queries (the number in parenthesis indicates expected #hits):
> These ones work *as expected*:
> * (1)  in-order, slop=0, "b", "x", "b"
> * (1)  in-order, slop=0, "b", "b"
> * (2)  in-order, slop=1, "b", "b"
> These ones match *too many* hits:
> * (1)  any-order, slop=0, "b", "x", "b"
> * (1)  any-order, slop=1, "b", "x", "b"
> * (1)  any-order, slop=2, "b", "x", "b"
> * (1)  any-order, slop=3, "b", "x", "b"
> These ones match *too many* hits as well:
> * (1)  any-order, slop=0, "b", "b"
> * (2)  any-order, slop=1, "b", "b"
> Each of the above passes when using a phrase query (applying the slop, no in-order indication in phrase query).
> This seems related to a known overlapping spans issue - [non-overlapping Span queries|http://markmail.org/message/7jxn5eysjagjwlon] - as indicated by Hoss, so we might decide to close this bug after all, but I would like to at least have the junit that exposes the behavior in JIRA.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3120) span query matches too many docs when two query terms are the same unless inOrder=true

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3120:
---------------------------------------

    Fix Version/s:     (was: 3.4)
                   3.5

> span query matches too many docs when two query terms are the same unless inOrder=true
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3120
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/search
>            Reporter: Doron Cohen
>            Priority: Minor
>             Fix For: 3.5, 4.0
>
>         Attachments: LUCENE-3120.patch, LUCENE-3120.patch
>
>
> spinoff of user list discussion - [SpanNearQuery - inOrder parameter|http://markmail.org/message/i4cstlwgjmlcfwlc].
> With 3 documents:
> *  "a b x c d"
> *  "a b b d"
> *  "a b x b y d"
> Here are a few queries (the number in parenthesis indicates expected #hits):
> These ones work *as expected*:
> * (1)  in-order, slop=0, "b", "x", "b"
> * (1)  in-order, slop=0, "b", "b"
> * (2)  in-order, slop=1, "b", "b"
> These ones match *too many* hits:
> * (1)  any-order, slop=0, "b", "x", "b"
> * (1)  any-order, slop=1, "b", "x", "b"
> * (1)  any-order, slop=2, "b", "x", "b"
> * (1)  any-order, slop=3, "b", "x", "b"
> These ones match *too many* hits as well:
> * (1)  any-order, slop=0, "b", "b"
> * (2)  any-order, slop=1, "b", "b"
> Each of the above passes when using a phrase query (applying the slop, no in-order indication in phrase query).
> This seems related to a known overlapping spans issue - [non-overlapping Span queries|http://markmail.org/message/7jxn5eysjagjwlon] - as indicated by Hoss, so we might decide to close this bug after all, but I would like to at least have the junit that exposes the behavior in JIRA.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3120) span query matches too many docs when two query terms are the same unless inOrder=true

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13036538#comment-13036538 ] 

Hoss Man commented on LUCENE-3120:
----------------------------------

comment i made on the mailing list regarding this topic...

{quote}
the crux of hte issue (as i recall) is that there is really no conecptual reason to why a query for "'john' near 'john', in any order, with slop of Z" shouldn't match a doc that contains only one instance of "john" ... the first SpanTermQuery says "i found a match at position X" the second SpanTermQuery says "i found a match at position Y" and the SpanNearQuery says "the differnece between X and Y is less then Z" therefore i have a match. (The SpanNearQuery can't fail just because X and Y are the same -- they might be two distinct term instances, with differnet payloads perhaps, that just happen to have the same position).

However: if true==inOrder case works because the SpanNearQuery enforces that "X must be less then Y" so the same term can't ever match twice. 
{quote}

> span query matches too many docs when two query terms are the same unless inOrder=true
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3120
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/search
>            Reporter: Doron Cohen
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3120.patch, LUCENE-3120.patch
>
>
> spinoff of user list discussion - [SpanNearQuery - inOrder parameter|http://markmail.org/message/i4cstlwgjmlcfwlc].
> With 3 documents:
> *  "a b x c d"
> *  "a b b d"
> *  "a b x b y d"
> Here are a few queries (the number in parenthesis indicates expected #hits):
> These ones work *as expected*:
> * (1)  in-order, slop=0, "b", "x", "b"
> * (1)  in-order, slop=0, "b", "b"
> * (2)  in-order, slop=1, "b", "b"
> These ones match *too many* hits:
> * (1)  any-order, slop=0, "b", "x", "b"
> * (1)  any-order, slop=1, "b", "x", "b"
> * (1)  any-order, slop=2, "b", "x", "b"
> * (1)  any-order, slop=3, "b", "x", "b"
> These ones match *too many* hits as well:
> * (1)  any-order, slop=0, "b", "b"
> * (2)  any-order, slop=1, "b", "b"
> Each of the above passes when using a phrase query (applying the slop, no in-order indication in phrase query).
> This seems related to a known overlapping spans issue - [non-overlapping Span queries|http://markmail.org/message/7jxn5eysjagjwlon] - as indicated by Hoss, so we might decide to close this bug after all, but I would like to at least have the junit that exposes the behavior in JIRA.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3120) span query matches too many docs when two query terms are the same unless inOrder=true

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13036540#comment-13036540 ] 

Hoss Man commented on LUCENE-3120:
----------------------------------

What we might want to consider is a new option on SpanNearQuery that would mandate that the spans not overlap.

Paul Elschot described the general form of this idea once as an numeric option to specify a minimum distance between the subspans (so the default, as implemented today, for inOrder==true would be minPositionDistance=1; and the default for inOrder==false would be minPositionDistance=0)




> span query matches too many docs when two query terms are the same unless inOrder=true
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3120
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/search
>            Reporter: Doron Cohen
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3120.patch, LUCENE-3120.patch
>
>
> spinoff of user list discussion - [SpanNearQuery - inOrder parameter|http://markmail.org/message/i4cstlwgjmlcfwlc].
> With 3 documents:
> *  "a b x c d"
> *  "a b b d"
> *  "a b x b y d"
> Here are a few queries (the number in parenthesis indicates expected #hits):
> These ones work *as expected*:
> * (1)  in-order, slop=0, "b", "x", "b"
> * (1)  in-order, slop=0, "b", "b"
> * (2)  in-order, slop=1, "b", "b"
> These ones match *too many* hits:
> * (1)  any-order, slop=0, "b", "x", "b"
> * (1)  any-order, slop=1, "b", "x", "b"
> * (1)  any-order, slop=2, "b", "x", "b"
> * (1)  any-order, slop=3, "b", "x", "b"
> These ones match *too many* hits as well:
> * (1)  any-order, slop=0, "b", "b"
> * (2)  any-order, slop=1, "b", "b"
> Each of the above passes when using a phrase query (applying the slop, no in-order indication in phrase query).
> This seems related to a known overlapping spans issue - [non-overlapping Span queries|http://markmail.org/message/7jxn5eysjagjwlon] - as indicated by Hoss, so we might decide to close this bug after all, but I would like to at least have the junit that exposes the behavior in JIRA.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3120) span query matches too many docs when two query terms are the same unless inOrder=true

Posted by "Greg Tarr (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13036080#comment-13036080 ] 

Greg Tarr commented on LUCENE-3120:
-----------------------------------

Thanks for raising this.

> span query matches too many docs when two query terms are the same unless inOrder=true
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3120
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3120
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: core/search
>            Reporter: Doron Cohen
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3120.patch
>
>
> spinoff of user list discussion - [SpanNearQuery - inOrder parameter|http://markmail.org/message/i4cstlwgjmlcfwlc].
> With 3 documents:
> *  "a b x c d"
> *  "a b b d"
> *  "a b x b y d"
> Here are a few queries (the number in parenthesis indicates expected #hits):
> These ones work *as expected*:
> * (1)  in-order, slop=0, "b", "x", "b"
> * (1)  in-order, slop=0, "b", "b"
> * (2)  in-order, slop=1, "b", "b"
> These ones match *too many* hits:
> * (1)  any-order, slop=0, "b", "x", "b"
> * (1)  any-order, slop=1, "b", "x", "b"
> * (1)  any-order, slop=2, "b", "x", "b"
> * (1)  any-order, slop=3, "b", "x", "b"
> These ones match *too many* hits as well:
> * (1)  any-order, slop=0, "b", "b"
> * (2)  any-order, slop=1, "b", "b"
> Each of the above passes when using a phrase query (applying the slop, no in-order indication in phrase query).
> This seems related to a known overlapping spans issue - [non-overlapping Span queries|http://markmail.org/message/7jxn5eysjagjwlon] - as indicated by Hoss, so we might decide to close this bug after all, but I would like to at least have the junit that exposes the behavior in JIRA.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org