You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Marvin Humphrey (JIRA)" <ji...@apache.org> on 2009/01/08 01:00:45 UTC

[jira] Commented: (LUCENE-1476) BitVector implement DocIdSet

    [ https://issues.apache.org/jira/browse/LUCENE-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661786#action_12661786 ] 

Marvin Humphrey commented on LUCENE-1476:
-----------------------------------------

While converting over PostingList (the Lucy/KS analogue to TermDocs) to use a deletions iterator, it occurred to me that the because the iterator has to keep state, a unique DelEnum object has to be created for every PostingList. In contrast, a BitVector object, which is accessed only via get(), can be shared.

It bugs me that each PostingList will have its own DelEnum performing interleaving of tombstones.  With very large queries against indexes with large numbers of deletions, it seems like a lot of duplicated work.

To minimize CPU cycles, it would theoretically make more sense to handle deletions much higher up, at the top level Scorer, Searcher, or even the HitCollector level.  PostingList would be completely ignorant of deletions, as would classes like NOTScorer and MatchAllScorer:

{code}
i32_t
MatchAllScorer_next(MatchAllScorer* self) 
{
    if (++self->doc_num > self->max_docs) {
        self->doc_num--;
        return 0;
    }
    return self->doc_num;
}
{code}

{code}
void
Scorer_collect(Scorer *self, Hitcollector *collector, DelEnum *del_enum)
{
    i32_t next_deletion = del_enum ? 0 : I32_MAX;
    i32_t doc_num = 1;
    while (1) {
        while (doc_num >= next_deletion) {
            next_deletion = DelEnum_Skip_To(del_enum, target);
            while (doc_num == next_deletion) {
                doc_num++;
                next_deletion = DelEnum_Skip_To(del_enum, doc_num);
            }
        }
        doc_num = Scorer_Skip_To(scorer, doc_num);
        if (doc_num) {
            HC_Collect(collector, doc_num, Scorer_Tally(scorer));
        }
        else { 
            break; 
        }
    }
}
{code}

The problem is that PostingLists spun off from indexReader.postingList(field, term) would include deleted documents, as would Scorers.

I suppose you could band-aid indexReader.postingList() by adding a boolean suppressDeletions argument which would default to true, but that seems messy.

Jason, I think the inefficiency of needing individual iterator objects applies to OpenBitSet as well, right?  As soon as we do away with random access, we have to keep state.  Dunno if it's going to be noticeable, but it's conceptually annoying.

> BitVector implement DocIdSet
> ----------------------------
>
>                 Key: LUCENE-1476
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1476
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.4
>            Reporter: Jason Rutherglen
>            Priority: Trivial
>         Attachments: LUCENE-1476.patch
>
>   Original Estimate: 12h
>  Remaining Estimate: 12h
>
> BitVector can implement DocIdSet.  This is for making SegmentReader.deletedDocs pluggable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org