You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael Busch (JIRA)" <ji...@apache.org> on 2007/04/21 04:23:15 UTC
[jira] Commented: (LUCENE-866) Multi-level skipping on posting lists

    [ https://issues.apache.org/jira/browse/LUCENE-866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12490495 ] 

Michael Busch commented on LUCENE-866:
--------------------------------------

I ran a few performance tests with this patch. I built a rather big 
index (about 1.2GB) using documents from Wikipedia and optimized the
index to get big posting lists.

I compare query evaluation time with one-level skipping (old) and
multi-level skipping (new) and measure the amount of I/O by counting 
the number of VInts read from disk. Each query runs several thousand
times.

The following queries contain very frequent and very unique terms. 
For these queries the speedup with multi-level skipping is 
significant:

   Query: +lucene +search +engine +library
   5 total matching documents
   VInt reads: old: 143268000, new: 3933000, -97.25479520897898%
   Time: old: 7234ms, new: 1157ms, -84.00608238871993%

   Query: +apache +http +server
   181 total matching documents
   VInt reads: old: 155892000, new: 27849000, -82.13570933723346%
   Time: old: 10656ms, new: 5703ms, -46.48085585585586%


Even though I/O is reduced for the next query, it runs a bit slower.
I believe the reason is that the same query runs several thousand
times, so the posting lists will be loaded into the file system
cache and the effect of less I/O is reduced, while the skipping
algorithm itself is a bit more complex:

   Query: +multi +level +skipping
   13 total matching documents
   VInt reads: old: 42894000, new: 39096000, -8.854385228703315%
   Time: old: 3875ms, new: 3922ms, 1.2129032258064516%


For the next query there is slightly more I/O necessary in the 
multi-skipping case. This is because not many big skips can be made,
but more skipping data has to be read. The top 2 skip levels are 
buffered in this first version of the patch and if no big skips
can be made than this buffering is overhead compared to the 
single-level case:

   Query: +beatles +yellow +submarine
   78 total matching documents
   VInt reads: old: 38460000, new: 38685000, 0.5850234009360374%
   Time: old: 3172ms, new: 3265ms, 2.9319041614123584%

However, if I change the query a little bit, then the speed-up
is significant (due to the very frequent stop word "the"):

   Query: +"the beatles" +yellow +submarine
   77 total matching documents
   VInt reads: old: 382307000, new: 262331000, -31.38210914265237%
   Time: old: 26703ms, new: 22828ms, -14.51147811107366%
   
   
It would be interesting to run more sophisticated benchmarks.
To run the same query several times is not very realistic 
because it reduces the effects of the I/O savings due to caching.

I'm not that familiar with the new benchmark stuff that has
been added recently, but I'll try to dig into that next week.

> Multi-level skipping on posting lists
> -------------------------------------
>
>                 Key: LUCENE-866
>                 URL: https://issues.apache.org/jira/browse/LUCENE-866
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Michael Busch
>            Priority: Minor
>         Attachments: lucene-866.patch
>
>
> To accelerate posting list skips (TermDocs.skipTo(int)) Lucene uses skip lists. 
> The default skip interval is set to 16. If we want to skip e. g. 100 documents, 
> then it is not necessary to read 100 entries from the posting list, but only 
> 100/16 = 6 skip list entries plus 100%16 = 4 entries from the posting list. This 
> speeds up conjunction (AND) and phrase queries significantly.
> However, the skip interval is always a compromise. If you have a very big index 
> with huge posting lists and you want to skip over lets say 100k documents, then 
> it is still necessary to read 100k/16 = 6250 entries from the skip list. For big 
> indexes the skip interval could be set to a higher value, but then after a big 
> skip a long scan to the target doc might be necessary.
> A solution for this compromise is to have multi-level skip lists that guarantee a 
> logarithmic amount of skips to any target in the posting list. This patch 
> implements such an approach in the following way:
>   Example for skipInterval = 3:
>                                                       c            (skip level 2)
>                   c                 c                 c            (skip level 1) 
>       x     x     x     x     x     x     x     x     x     x      (skip level 0)
>   d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d d  (posting list)
>       3     6     9     12    15    18    21    24    27    30     (df)
>  
>   d - document
>   x - skip data
>   c - skip data with child pointer
>  
> Skip level i contains every skipInterval-th entry from skip level i-1. Therefore the 
> number of entries on level i is: floor(df / ((skipInterval ^ (i + 1))).
>  
> Each skip entry on a level i>0 contains a pointer to the corresponding skip entry in 
> list i-1. This guarantees a logarithmic amount of skips to find the target document.
> Implementations details:
>    * I factored the skipping code out of SegmentMerger and SegmentTermDocs to 
>      simplify those classes. The two new classes AbstractSkipListReader and 
> 	 AbstractSkipListWriter implement the skipping functionality.
>    * While AbstractSkipListReader and Writer take care of writing and reading the 
>      multiple skip levels, they do not implement an actual skip data format. The two 
> 	 new subclasses DefaultSkipListReader and Writer implement the skip data format 
> 	 that is currently used in Lucene (with two file pointers for the freq and prox 
> 	 file and with payload length information). I added this extra layer to be 
> 	 prepared for flexible indexing and different posting list formats. 
>       
>    
> File format changes: 
>    * I added the new parameter 'maxSkipLevels' to the term dictionary and increased the
>      version of this file. If maxSkipLevels is set to one, then the format of the freq 
> 	 file does not change at all, because we only have one skip level as before. For 
> 	 backwards compatibility maxSkipLevels is set to one automatically if an index 
> 	 without the new parameter is read. 
>    * In case maxSkipLevels > 1, then the frq file changes as follows:
>      FreqFile (.frq) --> <TermFreqs, SkipData>^TermCount
> 	 SkipData        --> <<SkipLevelLength, SkipLevel>^(Min(maxSkipLevels, 
> 	                       floor(log(DocFreq/log(skipInterval))) - 1)>, SkipLevel>
> 	 SkipLevel       --> <SkipDatum>^DocFreq/(SkipInterval^(Level + 1))
> 	 Remark: The length of the SkipLevel is not stored for level 0, because 1) it is not 
> 	 needed, and 2) the format of this file does not change for maxSkipLevels=1 then.
> 	 
> 	 
> All unit tests pass with this patch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org