You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2011/01/23 16:08:44 UTC

[jira] Created: (LUCENE-2880) SpanQuery scoring inconsistencies

SpanQuery scoring inconsistencies
---------------------------------

                 Key: LUCENE-2880
                 URL: https://issues.apache.org/jira/browse/LUCENE-2880
             Project: Lucene - Java
          Issue Type: Bug
            Reporter: Robert Muir
             Fix For: 4.0


Spinoff of LUCENE-2879.

You can see a full description there, but the gist is that SpanQuery sums up freqs with "sloppyFreq".
However this slop is simply spans.end() - spans.start()

For a SpanTermQuery for example, this means its scoring 0.5 for TF versus TermQuery's 1.0.
As you can imagine, I think in practical situations this would make it difficult for SpanQuery users to
really use SpanQueries for effective ranking, especially in combination with non-Spanqueries (maybe via DisjunctionMaxQuery, etc)

The problem is more general than this simple example: for example SpanNearQuery should be consistent with PhraseQuery's slop.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2880) SpanQuery scoring inconsistencies

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12985407#action_12985407 ] 

Robert Muir commented on LUCENE-2880:
-------------------------------------

Paul I agree, I think the only way it would work is to be in Spans itself,
which is the real 'Scorer' for spanqueries. Because its wrong for SpanOrQuery
to have a getLength() really... just like it would be wrong for BooleanQuery
to know anything about phrase slops of its subqueries!

we can just leave this issue open and see what happens with
LUCENE-2878, and maybe a good solution will then be more obvious.


> SpanQuery scoring inconsistencies
> ---------------------------------
>
>                 Key: LUCENE-2880
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2880
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-2880.patch
>
>
> Spinoff of LUCENE-2879.
> You can see a full description there, but the gist is that SpanQuery sums up freqs with "sloppyFreq".
> However this slop is simply spans.end() - spans.start()
> For a SpanTermQuery for example, this means its scoring 0.5 for TF versus TermQuery's 1.0.
> As you can imagine, I think in practical situations this would make it difficult for SpanQuery users to
> really use SpanQueries for effective ranking, especially in combination with non-Spanqueries (maybe via DisjunctionMaxQuery, etc)
> The problem is more general than this simple example: for example SpanNearQuery should be consistent with PhraseQuery's slop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2880) SpanQuery scoring inconsistencies

Posted by "Paul Elschot (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12985388#action_12985388 ] 

Paul Elschot commented on LUCENE-2880:
--------------------------------------

The getLength() method may not be straightforward.

Does the getLength() method in SpanQuery also work in the nested case when there is an spanOr over two spanQueries of different length?

It may be necessary to add this length to Spans because of this.

Some reasons for a negative match length:
- multiple terms indexed at the same position, 
- span distance queries with the same subqueries.

I wish I had a good solution for this, but I did not find one yet.


> SpanQuery scoring inconsistencies
> ---------------------------------
>
>                 Key: LUCENE-2880
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2880
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-2880.patch
>
>
> Spinoff of LUCENE-2879.
> You can see a full description there, but the gist is that SpanQuery sums up freqs with "sloppyFreq".
> However this slop is simply spans.end() - spans.start()
> For a SpanTermQuery for example, this means its scoring 0.5 for TF versus TermQuery's 1.0.
> As you can imagine, I think in practical situations this would make it difficult for SpanQuery users to
> really use SpanQueries for effective ranking, especially in combination with non-Spanqueries (maybe via DisjunctionMaxQuery, etc)
> The problem is more general than this simple example: for example SpanNearQuery should be consistent with PhraseQuery's slop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2880) SpanQuery scoring inconsistencies

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2880:
--------------------------------

    Attachment: LUCENE-2880.patch

Here's a quickly hacked up patch (core tests pass, but i didnt go fixing contrib, etc yet).

Its just to get ideas.

The approach I took was for SpanQuery to have a new method:
{noformat}
  /** 
   * Returns the length (number of positions) in the query.
   * <p>
   * For example, for a simple Term this is 1.
   * For a NEAR of "foo" and "bar" this is 2.
   * This is used by SpanScorer to compute the appropriate slop factor,
   * so that SpanQueries score consistently with other queries.
   */
  public abstract int getLength();
{noformat}

This is called once by the Weight, and passed to SpanScorer.

Then SpanScorer computes the slop as:
{noformat}
int matchLength = (spans.end() - spans.start()) - queryLength;
{noformat}
instead of:
{noformat}
int matchLength = spans.end() - spans.start();
{noformat}


> SpanQuery scoring inconsistencies
> ---------------------------------
>
>                 Key: LUCENE-2880
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2880
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-2880.patch
>
>
> Spinoff of LUCENE-2879.
> You can see a full description there, but the gist is that SpanQuery sums up freqs with "sloppyFreq".
> However this slop is simply spans.end() - spans.start()
> For a SpanTermQuery for example, this means its scoring 0.5 for TF versus TermQuery's 1.0.
> As you can imagine, I think in practical situations this would make it difficult for SpanQuery users to
> really use SpanQueries for effective ranking, especially in combination with non-Spanqueries (maybe via DisjunctionMaxQuery, etc)
> The problem is more general than this simple example: for example SpanNearQuery should be consistent with PhraseQuery's slop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2880) SpanQuery scoring inconsistencies

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12985347#action_12985347 ] 

Robert Muir commented on LUCENE-2880:
-------------------------------------

thinking about this one, for this to really work correctly with the current setup (e.g. with SpanOrQuery), 
this length might have to be in the Spans class...

but with LUCENE-2878 we nuke this class, so we can keep the issue open to think about how
the slop should be computed for these queries, i think just using the end - start is not the best.


> SpanQuery scoring inconsistencies
> ---------------------------------
>
>                 Key: LUCENE-2880
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2880
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-2880.patch
>
>
> Spinoff of LUCENE-2879.
> You can see a full description there, but the gist is that SpanQuery sums up freqs with "sloppyFreq".
> However this slop is simply spans.end() - spans.start()
> For a SpanTermQuery for example, this means its scoring 0.5 for TF versus TermQuery's 1.0.
> As you can imagine, I think in practical situations this would make it difficult for SpanQuery users to
> really use SpanQueries for effective ranking, especially in combination with non-Spanqueries (maybe via DisjunctionMaxQuery, etc)
> The problem is more general than this simple example: for example SpanNearQuery should be consistent with PhraseQuery's slop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2880) SpanQuery scoring inconsistencies

Posted by "Paul Elschot (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12985387#action_12985387 ] 

Paul Elschot commented on LUCENE-2880:
--------------------------------------

A related problem is that Spans does not have a weight (or whatever factor) of its own.
Currently Spans can only be scored at the top level (by SpanScorer) and not when they are nested.
In the nested case the only way to affect to score value is via the length.

> SpanQuery scoring inconsistencies
> ---------------------------------
>
>                 Key: LUCENE-2880
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2880
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-2880.patch
>
>
> Spinoff of LUCENE-2879.
> You can see a full description there, but the gist is that SpanQuery sums up freqs with "sloppyFreq".
> However this slop is simply spans.end() - spans.start()
> For a SpanTermQuery for example, this means its scoring 0.5 for TF versus TermQuery's 1.0.
> As you can imagine, I think in practical situations this would make it difficult for SpanQuery users to
> really use SpanQueries for effective ranking, especially in combination with non-Spanqueries (maybe via DisjunctionMaxQuery, etc)
> The problem is more general than this simple example: for example SpanNearQuery should be consistent with PhraseQuery's slop.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org