You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Paul Elschot (JIRA)" <ji...@apache.org> on 2013/07/14 17:02:49 UTC

[jira] [Comment Edited] (LUCENE-5101) make it easier to plugin different bitset implementations to CachingWrapperFilter

    [ https://issues.apache.org/jira/browse/LUCENE-5101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13708046#comment-13708046 ] 

Paul Elschot edited comment on LUCENE-5101 at 7/14/13 3:02 PM:
---------------------------------------------------------------

bq. I wrote a benchmark (attached) to see how they compared to FixedBitSet, you can look at the results here: http://people.apache.org/~jpountz/doc_id_sets.html 

Thank you very much, plenty of dilemma's ahead. Do WAH8 and PFOR already have an index?
With an index, each of these, including Elias-Fano, should have about constant access time when advancing far enough. What that constant time will be is still open.

Block decoding might still be added to EliasFano, which should improve its nextDoc() performance, but I have no idea by how much. See also at LUCENE-2750 for Kamikaze PFOR.
The Elias-Fano code is not tuned yet, so I'm surprised that the Elias-Fano time for nextDoc() is less than a factor two worse than PFOR.

Another surpise is that Elias-Fano is best at advance() among the compressed sets for some cases. That means that Long.bitCount() is doing well on the upper bits then.

For bit densities > 1/2 there is clear need for WAH8 and Elias-Fano to be able to encode the inverse set. Could that be done by a common wrapper?

                
      was (Author: paul.elschot@xs4all.nl):
    bq. I wrote a benchmark (attached) to see how they compared to FixedBitSet, you can look at the results here: http://people.apache.org/~jpountz/doc_id_sets.html 

Thank you very much, plenty of dilemma's ahead. Do WAH8 and PFOR already have an index?
With an index, each of these, including Elias-Fano, should have about constant access time when advancing far enough. What that constant time will be is still open.

Block decoding might still be added to EliasFano, which should improve its next() performance, but I have no idea by how much. See also at LUCENE-2750 for Kamikaze PFOR.
The Elias-Fano code is not tuned yet, so I'm surprised that the Elias-Fano time for nextDoc() is less than a factor two worse than PFOR.

Another surpise is that Elias-Fano is best at advance() among the compressed sets for some cases. That means that Long.bitCount() is doing well on the upper bits then.

For bit densities > 1/2 there is clear need for WAH8 and Elias-Fano to be able to encode the inverse set. Could that be done by a common wrapper?

                  
> make it easier to plugin different bitset implementations to CachingWrapperFilter
> ---------------------------------------------------------------------------------
>
>                 Key: LUCENE-5101
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5101
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Robert Muir
>         Attachments: LUCENE-5101.patch
>
>
> Currently this is possible, but its not so friendly:
> {code}
>   protected DocIdSet docIdSetToCache(DocIdSet docIdSet, AtomicReader reader) throws IOException {
>     if (docIdSet == null) {
>       // this is better than returning null, as the nonnull result can be cached
>       return EMPTY_DOCIDSET;
>     } else if (docIdSet.isCacheable()) {
>       return docIdSet;
>     } else {
>       final DocIdSetIterator it = docIdSet.iterator();
>       // null is allowed to be returned by iterator(),
>       // in this case we wrap with the sentinel set,
>       // which is cacheable.
>       if (it == null) {
>         return EMPTY_DOCIDSET;
>       } else {
> /* INTERESTING PART */
>         final FixedBitSet bits = new FixedBitSet(reader.maxDoc());
>         bits.or(it);
>         return bits;
> /* END INTERESTING PART */
>       }
>     }
>   }
> {code}
> Is there any value to having all this other logic in the protected API? It seems like something thats not useful for a subclass... Maybe this stuff can become final, and "INTERESTING PART" calls a simpler method, something like:
> {code}
> protected DocIdSet cacheImpl(DocIdSetIterator iterator, AtomicReader reader) {
>   final FixedBitSet bits = new FixedBitSet(reader.maxDoc());
>   bits.or(iterator);
>   return bits;
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org