You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Paul Elschot (JIRA)" <ji...@apache.org> on 2015/11/12 23:27:11 UTC

[jira] [Comment Edited] (LUCENE-6894) Improve DISI.cost() by assuming independence for match probabilities

    [ https://issues.apache.org/jira/browse/LUCENE-6894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15003039#comment-15003039 ] 

Paul Elschot edited comment on LUCENE-6894 at 11/12/15 10:27 PM:
-----------------------------------------------------------------

Another reason why I started this is that the result of cost() is also used as weights for matchCost() at LUCENE-6276, and I'd prefer those weights to be as accurate as reasonably possible.

I think we can keep this (assuming independence for conjunctions and disjunctions) as a possible alternative until the current implementation gives a bad result.

For the proximity queries (Phrases, Spans) this reduces the conjunction cost() using the allowed slop.
Would it be worthwhile to open a separate issue for that?



was (Author: paul.elschot@xs4all.nl):
Another reason why I started this is that the result of cost() is also used as weights for matchCost() at LUCENE-6276, and I'd prefer those weights to be as accurate as reasonably possible.

I think we can keep this alternative (assuming independence for conjunctions and disjunctions) as a possible alternative until the current implementation gives a bad result.

For the proximity queries (Phrases, Spans) this reduces the conjunction cost() using the allowed slop.
Would it be worthwhile to open a separate issue for that?


> Improve DISI.cost() by assuming independence for match probabilities
> --------------------------------------------------------------------
>
>                 Key: LUCENE-6894
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6894
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>            Reporter: Paul Elschot
>            Priority: Minor
>         Attachments: LUCENE-6894.patch
>
>
> The DocIdSetIterator.cost() method returns an estimation of the number of matching docs. Currently conjunctions use the minimum cost, and disjunctions use the sum of the costs, and both are too high.
> The probability of a match is estimated by dividing available cost() by the number of docs in a segment.
> The conjunction probability is then the product of the inputs, and the disjunction probability follows from De Morgan's rule:
> "not (A and B)" is the same as "(not A) or (not B)"
> with the probability for "not" computed as 1 minus the input probability.
> The independence that is assumed is normally not there. However, the cost() results are only used to order the input DISIs/Scorers for optimization, and for that I expect this assumption to work nicely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org