You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by "mikemccand (via GitHub)" <gi...@apache.org> on 2023/08/02 19:52:10 UTC

[GitHub] [lucene] mikemccand opened a new issue, #12486: Explore within-block skipping for postings

mikemccand opened a new issue, #12486:
URL: https://github.com/apache/lucene/issues/12486

   ### Description
   
   One of the differences between Tantivy and Lucene is that Tantivy supports within-block skipping (branchless binary search to locate a target docid inside a block of 128 postings), but Lucene skips only to block boundaries, and then does a linear scan within.  Lucene is also doing the "accumulate docid deltas into the absolute docid" too in this loop, but I guess Tantivy does this separately somehow?
   
   Anyway, we could explore within-block skipping as well -- it might make conjunctive queries where both terms have roughly the same cardinality, quite a bit faster?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [I] Explore within-block skipping for postings [lucene]

Posted by "jpountz (via GitHub)" <gi...@apache.org>.

jpountz commented on issue #12486:
URL: https://github.com/apache/lucene/issues/12486#issuecomment-1787160140

   > Lucene is also doing the "accumulate docid deltas into the absolute docid" too in this loop, but I guess Tantivy does this separately somehow?
   
   I believe Tantivy does the same, except that it can take advantage of SIMD to accumulate docid deltas into the absolute docid (if it did not accumulate deltas up-front, it could not run a branchless binary search later on). I tried to look into whether we can do the same with Panama, but last time I checked it doesn't give ways to use `_mm_slli_si128`, which prevents us from making a faster prefix sum through vectorization: https://github.com/jpountz/vectorized-prefix-sum.
   
   For reference, there is also this old @mkhludnev idea about encoding dense postings lists as bitsets, which would naturally help with skipping: #6116 (or can we do it on a per-block basis?). And more generally, there are some formats that are better at skipping within blocks like Elias-Fano.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org