You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael Busch (JIRA)" <ji...@apache.org> on 2009/11/18 22:52:39 UTC

[jira] Created: (LUCENE-2082) Performance improvement for merging posting lists

Performance improvement for merging posting lists
-------------------------------------------------

                 Key: LUCENE-2082
                 URL: https://issues.apache.org/jira/browse/LUCENE-2082
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Index
            Reporter: Michael Busch
            Priority: Minor
             Fix For: 3.1


A while ago I had an idea about how to improve the merge performance
for posting lists. This is currently by far the most expensive part of
segment merging due to all the VInt de-/encoding. Not sure if an idea
for improving this was already mentioned in the past?

So the basic idea is it to perform a raw copy of as much posting data
as possible. The reason why this is difficult is that we have to
remove deleted documents. But often the fraction of deleted docs in a
segment is rather low (<10%?), so it's likely that there are quite
long consecutive sections without any deletions.

To find these sections we could use the skip lists. Basically at any
point during the merge we would find the skip entry before the next
deleted doc. All entries to this point can be copied without
de-/encoding of the VInts. Then for the section that has deleted docs
we perform the "normal" way of merging to remove the deletes. Then we
check again with the skip lists if we can raw copy the next section.

To make this work there are a few different necessary changes:

1) Currently the multilevel skiplist reader/writer can only deal with fixed-size
skips (16 on the lowest level). It would be an easy change to allow
variable-size skips, but then the MultiLevelSkipListReader can't
return numSkippedDocs anymore, which SegmentTermDocs needs -> change 2)

2) Store the last docID in which a term occurred in the term
dictionary. This would also be beneficial for other use cases. By
doing that the SegmentTermDocs#next(), #read() and #skipTo() know when
the end of the postinglist is reached. Currently they have to track
the df, which is why after a skip it's important to take the
numSkippedDocs into account.

3) Change the merging algorithm according to my description above. It's
important to create a new skiplist entry at the beginning of every
block that is copied in raw mode, because its next skip entry's values
are deltas from the beginning of the block. Also the very first posting, and
that one only, needs to be decoded/encoded to make sure that the
payload length is explicitly written (i.e. must not depend on the
previous length). Also such a skip entry has to be created at the
beginning of each source segment's posting list. With change 2) we don't
have to worry about the positions of the skip entries. And having a few
extra skip entries in merged segments won't hurt much.


If a segment has no deletions at all this will avoid any
decoding/encoding of VInts (best case). I think it will also work
great for segments with a rather low amount of deletions. We should
probably then have a threshold: if the number of deletes exceeds this
threshold we should fall back to old style merging.

I haven't implemented any of this, so there might be complications I
haven't thought about. Please let me know if you can think of reasons
why this wouldn't work or if you think more changes are necessary.

I will probably not have time to work on this soon, but I wanted to
open this issue to not forget about it :). Anyone should feel free to
take this!

Btw: I think the flex-indexing branch would be a great place to try this
out as a new codec. This would also be good to figure out what APIs
are needed to make merging fully flexible as well.







-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2082) Performance improvement for merging posting lists

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781027#action_12781027 ] 

Michael McCandless commented on LUCENE-2082:
--------------------------------------------

This sounds like a neat idea!  For building up a new index (no deletions) it ought to be a sizable performance gain.

I just committed changes on the flex branch to make it possible for codecs to override merging...

> Performance improvement for merging posting lists
> -------------------------------------------------
>
>                 Key: LUCENE-2082
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2082
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>
> A while ago I had an idea about how to improve the merge performance
> for posting lists. This is currently by far the most expensive part of
> segment merging due to all the VInt de-/encoding. Not sure if an idea
> for improving this was already mentioned in the past?
> So the basic idea is it to perform a raw copy of as much posting data
> as possible. The reason why this is difficult is that we have to
> remove deleted documents. But often the fraction of deleted docs in a
> segment is rather low (<10%?), so it's likely that there are quite
> long consecutive sections without any deletions.
> To find these sections we could use the skip lists. Basically at any
> point during the merge we would find the skip entry before the next
> deleted doc. All entries to this point can be copied without
> de-/encoding of the VInts. Then for the section that has deleted docs
> we perform the "normal" way of merging to remove the deletes. Then we
> check again with the skip lists if we can raw copy the next section.
> To make this work there are a few different necessary changes:
> 1) Currently the multilevel skiplist reader/writer can only deal with fixed-size
> skips (16 on the lowest level). It would be an easy change to allow
> variable-size skips, but then the MultiLevelSkipListReader can't
> return numSkippedDocs anymore, which SegmentTermDocs needs -> change 2)
> 2) Store the last docID in which a term occurred in the term
> dictionary. This would also be beneficial for other use cases. By
> doing that the SegmentTermDocs#next(), #read() and #skipTo() know when
> the end of the postinglist is reached. Currently they have to track
> the df, which is why after a skip it's important to take the
> numSkippedDocs into account.
> 3) Change the merging algorithm according to my description above. It's
> important to create a new skiplist entry at the beginning of every
> block that is copied in raw mode, because its next skip entry's values
> are deltas from the beginning of the block. Also the very first posting, and
> that one only, needs to be decoded/encoded to make sure that the
> payload length is explicitly written (i.e. must not depend on the
> previous length). Also such a skip entry has to be created at the
> beginning of each source segment's posting list. With change 2) we don't
> have to worry about the positions of the skip entries. And having a few
> extra skip entries in merged segments won't hurt much.
> If a segment has no deletions at all this will avoid any
> decoding/encoding of VInts (best case). I think it will also work
> great for segments with a rather low amount of deletions. We should
> probably then have a threshold: if the number of deletes exceeds this
> threshold we should fall back to old style merging.
> I haven't implemented any of this, so there might be complications I
> haven't thought about. Please let me know if you can think of reasons
> why this wouldn't work or if you think more changes are necessary.
> I will probably not have time to work on this soon, but I wanted to
> open this issue to not forget about it :). Anyone should feel free to
> take this!
> Btw: I think the flex-indexing branch would be a great place to try this
> out as a new codec. This would also be good to figure out what APIs
> are needed to make merging fully flexible as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org