You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Paul Elschot (JIRA)" <ji...@apache.org> on 2015/11/11 22:31:10 UTC

[jira] [Updated] (LUCENE-6894) Improve DISI.cost() by assuming independence for match probabilities

     [ https://issues.apache.org/jira/browse/LUCENE-6894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paul Elschot updated LUCENE-6894:
---------------------------------
    Attachment: LUCENE-6894.patch

Patch of 11 Nov 2015.
Most of the changes are to pass numDocs down to where it is actually used:
ConjunctionDISI, DisjunctionDISIApproximation, DisjunctionScorer, ConjunctionSpans, SpanOrQuery.


This is incomplete, there no tests.
MinShouldMatchSumScorer only has the disjunctions done.
For un/ordered NearSpans there is a division by 4 (unordered) and by 8 (ordered) for zero allowed slop, something like this should also be done for the PhraseQueries.
SpanContaining and SpanWithin use the conjunction estimation, these should also be smaller.


> Improve DISI.cost() by assuming independence for match probabilities
> --------------------------------------------------------------------
>
>                 Key: LUCENE-6894
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6894
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>            Reporter: Paul Elschot
>            Priority: Minor
>         Attachments: LUCENE-6894.patch
>
>
> The DocIdSetIterator.cost() method returns an estimation of the number of matching docs. Currently conjunctions use the minimum cost, and disjunctions use the sum of the costs, and both are too high.
> The probability of a match is estimated by dividing available cost() by the number of docs in a segment.
> The conjunction probability is then the product of the inputs, and the disjunction probability follows from De Morgan's rule:
> "not (A and B)" is the same as "(not A) or (not B)"
> with the probability for "not" computed as 1 minus the input probability.
> The independence that is assumed is normally not there. However, for cost() computations only an ordering of the input DISIs/Scorers is needed, and for that I expect this assumption to work nicely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org