You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Adrien Grand (JIRA)" <ji...@apache.org> on 2015/06/23 19:46:00 UTC

[jira] [Updated] (LUCENE-6553) Simplify how we handle deleted docs in read APIs

     [ https://issues.apache.org/jira/browse/LUCENE-6553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adrien Grand updated LUCENE-6553:
---------------------------------
    Attachment: LUCENE-6553.patch

Here is a patch that removes the handling of acceptDocs from the postings, spans and scorer APIs, and moves it from the constructor to the score() method for BulkScorer.

In general I think it simplifies the code a lot:
 - we have lots of postings formats and query impls that do not need to care about deleted docs at all anymore since they use the default bulk scorer
 - CheckIndex does not need to test that postings formats ignore deleted docs correctly

One thing I am unsure about is whether LeafReader.postings should still apply deleted docs or not. At least for other call sites, there would be a compilation error since the acceptDocs parameter was removed, but this method did not have such a parameter and implicitely applied the reader's live docs. For now I documented explicitly that live docs were not applied, but I could also understand why someone would like to see live docs applied for this method. The reason why I decided to not apply live docs is that then if you use this method in a Query implementation, the Scorer would be illegal since it would apply live docs while it's not supposed to.

> Simplify how we handle deleted docs in read APIs
> ------------------------------------------------
>
>                 Key: LUCENE-6553
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6553
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>            Priority: Minor
>             Fix For: Trunk
>
>         Attachments: LUCENE-6553.patch
>
>
> Today, all scorers and postings formats need to be able to handle deleted documents.
> I suspect that the reason is that we want to be able to make sure to not perform costly operations on documents that are deleted. For instance if you run a phrase query, reading positions on a document which is deleted is useless. I suspect this is also a source of inefficiencies since in some cases we apply deleted documents several times: for instance conjunctions apply deleted docs to every sub scorer.
> However, with the new two-phase iteration API, we have a way to make sure that we never run expensive operations on deleted documents: we could first iterate over the approximation, then check that the document is not deleted, and finally confirm the match. Since approximations are cheap, applying deleted docs after them would not be an issue.
> I would like to explore removing the "Bits acceptDocs" parameter from TermsEnum.postings, Weight.scorer, SpanWeight.getSpans and Weight.BulkScorer, and add it to BulkScorer.score. This way, bulk scorers would be the only API which would need to know how to apply deleted docs, which I think would be more manageable since we only have 3 or 4 impls. And DefaultBulkScorer would be implemented the way described above: first advance the approximation, then check deleted docs, then confirm the match, then collect. Of course that's only in the case the scorer supports approximations, if it does not, it means it is cheap so we can directly iterate the scorer and check deleted docs on top.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org