You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/02/22 08:09:24 UTC

[GitHub] [lucene] jpountz commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

jpountz commented on pull request #692:
URL: https://github.com/apache/lucene/pull/692#issuecomment-1047526553

In my opinion the API as it is today isn't bad. The only thing we might want to change is to make `DocIdSetBuilder#grow` take a long instead of an int.

Maybe it's a javadocs issue because `DocIdSetBuilder#grow` says that it returns "a `BulkAdder` object that can be used to add up to `numDocs` documents", which might suggest that `numDocs` is the number of unique documents contributed, when in fact this number is simply an upper bound of the number of times that you may call `BulkAdder#add` on the returned `BulkAdder` object.

> I'm still a bit confused about why we need to grow(long) on a bitset that can only hold Integer.MAX_VALUE elements.

This doesn't have anything to do with the `long counter` that you looked at.

The point of `BulkAdder#add` is to call it every time we find a matching (docID, value) pair, and the number of matching pairs may be larger than `Integer#MAX_VALUE` (e.g. a range over a multi-valued field that matches all docs but one), hence the long. This is the same reason why e.g. `SortedSetDocValues#nextOrd` returns a long.

> in the sparse/buffer case, wouldn't a much simpler estimation simply be the length of int array?

This is already the case today, see the `else` block in `DocIdSetBuilder#build`. The cost estimation logic only happens in the dense case when a `FixedBitSet` is used to hold the set of matching docs.

FWIW we could change the estimation logic to perform a popCount over a subset of the `FixedBitSet` and scale it according to the size of the bitset or something along these lines, if we think that it would be better than tracking this counter and dividing it by the number of values per doc.

> I'm also confused why we have this sorted array buffer case instead of using SparseFixedBitSet

`SparseFixedBitSet` is the right choice for the sparse case when you need something that implements the `BitSet` API. Here we only need to produce a `DocIdSet` and buffering doc IDs into an array and sorting them using radix sort proved to be faster than accumulating doc IDs into a `SparseFixedBitSet`.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org