You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2009/11/21 20:00:39 UTC

[jira] Commented: (LUCENE-2082) Performance improvement for merging posting lists

    [ https://issues.apache.org/jira/browse/LUCENE-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781027#action_12781027 ] 

Michael McCandless commented on LUCENE-2082:
--------------------------------------------

This sounds like a neat idea!  For building up a new index (no deletions) it ought to be a sizable performance gain.

I just committed changes on the flex branch to make it possible for codecs to override merging...

> Performance improvement for merging posting lists
> -------------------------------------------------
>
>                 Key: LUCENE-2082
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2082
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> A while ago I had an idea about how to improve the merge performance
> for posting lists. This is currently by far the most expensive part of
> segment merging due to all the VInt de-/encoding. Not sure if an idea
> for improving this was already mentioned in the past?
> So the basic idea is it to perform a raw copy of as much posting data
> as possible. The reason why this is difficult is that we have to
> remove deleted documents. But often the fraction of deleted docs in a
> segment is rather low (<10%?), so it's likely that there are quite
> long consecutive sections without any deletions.
> To find these sections we could use the skip lists. Basically at any
> point during the merge we would find the skip entry before the next
> deleted doc. All entries to this point can be copied without
> de-/encoding of the VInts. Then for the section that has deleted docs
> we perform the "normal" way of merging to remove the deletes. Then we
> check again with the skip lists if we can raw copy the next section.
> To make this work there are a few different necessary changes:
> 1) Currently the multilevel skiplist reader/writer can only deal with fixed-size
> skips (16 on the lowest level). It would be an easy change to allow
> variable-size skips, but then the MultiLevelSkipListReader can't
> return numSkippedDocs anymore, which SegmentTermDocs needs -> change 2)
> 2) Store the last docID in which a term occurred in the term
> dictionary. This would also be beneficial for other use cases. By
> doing that the SegmentTermDocs#next(), #read() and #skipTo() know when
> the end of the postinglist is reached. Currently they have to track
> the df, which is why after a skip it's important to take the
> numSkippedDocs into account.
> 3) Change the merging algorithm according to my description above. It's
> important to create a new skiplist entry at the beginning of every
> block that is copied in raw mode, because its next skip entry's values
> are deltas from the beginning of the block. Also the very first posting, and
> that one only, needs to be decoded/encoded to make sure that the
> payload length is explicitly written (i.e. must not depend on the
> previous length). Also such a skip entry has to be created at the
> beginning of each source segment's posting list. With change 2) we don't
> have to worry about the positions of the skip entries. And having a few
> extra skip entries in merged segments won't hurt much.
> If a segment has no deletions at all this will avoid any
> decoding/encoding of VInts (best case). I think it will also work
> great for segments with a rather low amount of deletions. We should
> probably then have a threshold: if the number of deletes exceeds this
> threshold we should fall back to old style merging.
> I haven't implemented any of this, so there might be complications I
> haven't thought about. Please let me know if you can think of reasons
> why this wouldn't work or if you think more changes are necessary.
> I will probably not have time to work on this soon, but I wanted to
> open this issue to not forget about it :). Anyone should feel free to
> take this!
> Btw: I think the flex-indexing branch would be a great place to try this
> out as a new codec. This would also be good to figure out what APIs
> are needed to make merging fully flexible as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org