You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/02/18 12:42:47 UTC

[GitHub] [lucene] iverase opened a new pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

iverase opened a new pull request #692:
URL: https://github.com/apache/lucene/pull/692


   The idea in this PR is the following:
   
   1) Extract the sparse structure to collect docIds from docIdSetBuilder in a class called Buffers so it can be used by the different implementations.
   2) Add a new DocIdSetBuilder implementation for points called `PointsDocIdSetBuilder` with the following API:
   
   ```
    /**
      * Utility class to efficiently add many docs in one go.
      *
      * @see PointsDocIdSetBuilder#grow
      */
     public abstract static class BulkAdder {
       public abstract void add(int doc);
   
       public void add(DocIdSetIterator iterator) throws IOException {
         int docID;
         while ((docID = iterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
           add(docID);
         }
       }
     }
   
     /**
      * Reserve space and return a {@link BulkAdder} object that can be used to visit up to {@code
      * numPoints} points.
      */
     public BulkAdder grow(long numPoints);
   ```
   
   3) DocIdSet builder API looks like that after the extraction of the points API:
   
   ```
   /**
      * Add the content of the provided {@link DocIdSetIterator} to this builder. NOTE: if you need to
      * build a {@link DocIdSet} out of a single {@link DocIdSetIterator}, you should rather use {@link
      * RoaringDocIdSet.Builder}.
      */
   public void add(DocIdSetIterator iter) throws IOException;
   
   
    /** Add a single document to this builder. */
     public void add(int doc);
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] iverase commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

Posted by GitBox <gi...@apache.org>.

iverase commented on pull request #692:
URL: https://github.com/apache/lucene/pull/692#issuecomment-1047752116


   I don't understand all this discussion. Looking at the cost of a DocIdSetIterator:
   
   ```
     /**
      * Returns the estimated cost of this {@link DocIdSetIterator}.
      *
      * <p>This is generally an upper bound of the number of documents this iterator might match, but
      * may be a rough heuristic, hardcoded value, or otherwise completely inaccurate.
      */
     public abstract long cost();
   ```
   
   Why it is ok a long here?  I think the dance we are doing on the BKD reader when wee are visiting more that Integer.MAX_VALUE documents is wrong and should be fixed. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

Posted by GitBox <gi...@apache.org>.

rmuir commented on pull request #692:
URL: https://github.com/apache/lucene/pull/692#issuecomment-1047729980


   > In my opinion the API as it is today isn't bad. The only thing we might want to change is to make `DocIdSetBuilder#grow` take a long instead of an int.
   
   I've really tried, I think I have to just give up. Having a `grow(long)` on something with `DocIdSet` in its name is beyond bad, it is terrible.
   
   Please, please, please don't make this change to take a long.
   
   > > I'm still a bit confused about why we need to grow(long) on a bitset that can only hold Integer.MAX_VALUE elements.
   > 
   > This doesn't have anything to do with the `long counter` that you looked at.
   > 
   > The point of `BulkAdder#add` is to call it every time we find a matching (docID, value) pair, and the number of matching pairs may be larger than `Integer#MAX_VALUE` (e.g. a range over a multi-valued field that matches all docs but one), hence the long. This is the same reason why e.g. `SortedSetDocValues#nextOrd` returns a long.
   
   Sure it does. I'm looking at the only code using the 64-bit value, and that's the `counter`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

Posted by GitBox <gi...@apache.org>.

rmuir commented on pull request #692:
URL: https://github.com/apache/lucene/pull/692#issuecomment-1049948208


   prototype: https://github.com/apache/lucene/pull/709


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] iverase edited a comment on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

Posted by GitBox <gi...@apache.org>.

iverase edited a comment on pull request #692:
URL: https://github.com/apache/lucene/pull/692#issuecomment-1047752116


   I don't understand all this discussion. Looking at the cost of a DocIdSetIterator:
   
   ```
     /**
      * Returns the estimated cost of this {@link DocIdSetIterator}.
      *
      * <p>This is generally an upper bound of the number of documents this iterator might match, but
      * may be a rough heuristic, hardcoded value, or otherwise completely inaccurate.
      */
     public abstract long cost();
   ```
   
   Why it is ok a long here?  I think the dance we are doing on the BKD reader when wee are visiting more that Integer.MAX_VALUE ~documents~ points is wrong and should be fixed. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

Posted by GitBox <gi...@apache.org>.

rmuir commented on pull request #692:
URL: https://github.com/apache/lucene/pull/692#issuecomment-1046065536

I'm still a bit confused about why we need to `grow(long)` on a bitset that can only hold `Integer.MAX_VALUE` elements. I've re-read the description of the JIRA several times this morning, but honestly I was confused about this before, too.

It seems the only purpose of the `long`, we're doing a lot of elaborate work just to estimate the cardinality that we'll ultimately pass down to the bitset? I don't see any other use of the `long` value other than this `counter`, and that's all we are doing with it. But surely using a `long` isn't helpful to this estimation, maybe we should just estimate it differently?

Sorry if my comment isn't very helpful, but I want to really understand the problem and why we need to bring 64 bits into this, currently it is very confusing. Perhaps we should remove this `counter` completely (temporarily), and use the other `BitDocIdSet` constructor. How simpler does the code get then?

It turns the problem around, into, how can we estimate the cost better. I don't think we should have to majorly reorganize the code just for this `counter`, it doesn't seem right at all. There are other ways we could estimate the cost rather than summing up the calls to `grow()`.

For example in the sparse/buffer case, wouldn't a much simpler estimation simply be the `length` of int array? I'm also confused why we have this sorted array buffer case instead of using SparseFixedBitSet (which has approximateCardinality already and needs no such special grow-tracking).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

Posted by GitBox <gi...@apache.org>.

rmuir commented on pull request #692:
URL: https://github.com/apache/lucene/pull/692#issuecomment-1047788256


   it's fine to do that since only 32 bits are needed.
   
   nothing uses 64-bits here, hence changing the api signature to a `long` is wrong.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

Posted by GitBox <gi...@apache.org>.

rmuir commented on pull request #692:
URL: https://github.com/apache/lucene/pull/692#issuecomment-1049894634


   If this is literally all about "style" issue then let's be open and honest about that. I am fine with:
   ```
   /** sugar: to just make code look pretty, nothing else */
   public BulkAdder grow(long numDocs) {
     grow((int) Math.min(Integer.MAX_VALUE, numDocs));
   }
   ```
   
   But I think it is wrong to have constructors taking `Terms` and `PointValues` already: it is just more useless complexity and "leaky abstraction" from the terrible cost estimation.
   
   And I definitely think having two separate classes just for the cost estimation is way too much.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

Posted by GitBox <gi...@apache.org>.

rmuir commented on pull request #692:
URL: https://github.com/apache/lucene/pull/692#issuecomment-1049869937

> @rmuir We can remove the cost estimation, but it will not address the problem. I'll try to explain the problem differently in case it helps.

I really think it will address the problem. I understand what is happening, but adding 32 more bits that merely get discarded also will not help anything. That's what is being discussed here.

It really is all about cost estimation, as that is the ONLY thing in this PR actually using the 32 extra bits. That's why i propose to simply use a different cost estimation instead. The current cost estimation explodes the complexity of this class: that's why we are tracking:
* `boolean multiValued`
* `double numValuesPerDoc`
* `long counter`

There's no need (from allocation perspective, which is all we should be concerned about here) to know about any numbers bigger than `Integer.MAX_VALUE`, if we get anywhere near numbers that big, we should be switching over to the `FixedBitSet` representation.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

Posted by GitBox <gi...@apache.org>.

jpountz commented on pull request #692:
URL: https://github.com/apache/lucene/pull/692#issuecomment-1047748712


   > Having a grow(long) on something with DocIdSet in its name is beyond bad, it is terrible.
   
   Would it look better if we gave it a different name that doesn't suggest that it relates to the number of docs in the set, e.g. `prepareAdd` or something along these lines?
   
   > Please, please, please don't make this change to take a long.
   
   I have a preference for making it a long but I'm ok with keeping it an integer. The downside is that it pushes the problem to callers, which need to make sure that they never add more than `Integer.MAX_VALUE` documents with the same `BulkAdder`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

Posted by GitBox <gi...@apache.org>.

rmuir commented on pull request #692:
URL: https://github.com/apache/lucene/pull/692#issuecomment-1047750276


   <img width="827" alt="Screen Shot 2022-02-22 at 7 29 00 AM" src="https://user-images.githubusercontent.com/504194/155133007-71ec1d81-a2bd-485d-b7e6-17a10cd78fdf.png">
   
   I've uploaded a screenshot here of how the only thing using 64-bits is this stupid `counter`. Guys, we really have to agree on this simple fact to proceed. It is a fact!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

Posted by GitBox <gi...@apache.org>.

rmuir commented on pull request #692:
URL: https://github.com/apache/lucene/pull/692#issuecomment-1049905030


   To try to be more helpful, here's what i'd propose. I can try to hack up a draft PR later if we want, if it is helpful.
   
   DocIdSetBuilder, remove complex cost estimation:
   * remove `DocIdSetBuilder(int, Terms)` constructor
   * remove `DocIdSetBuilder(int, PointValues)` constructor
   * remove `DocIdSetBuilder.counter` member
   * remove `DocIdSetBuilder.multiValued` member
   * remove `DocIdSetBuilder.numValuesPerDoc` member
   
   DocIdSetBuilder: add sugar `grow(long)` for style purposes:
   ```
   /** sugar: to just make code look pretty, nothing else */
   public BulkAdder grow(long numDocs) {
     grow((int) Math.min(Integer.MAX_VALUE, numDocs));
   }
   ```
   
   FixedBitSet: implement `approximateCardinality()` and simply use it when estimating cost() here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] iverase closed pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

Posted by GitBox <gi...@apache.org>.

iverase closed pull request #692:
URL: https://github.com/apache/lucene/pull/692


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] iverase commented on a change in pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

Posted by GitBox <gi...@apache.org>.

iverase commented on a change in pull request #692:
URL: https://github.com/apache/lucene/pull/692#discussion_r809966771



##########
File path: lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java
##########
@@ -167,99 +85,44 @@ public void add(DocIdSetIterator iter) throws IOException {
       return;
     }

Review comment:
       I haven't change the implementation here but this looks buggy for me. If the bitSet is not null we don't update counter any more in this case? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] iverase commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

Posted by GitBox <gi...@apache.org>.

iverase commented on pull request #692:
URL: https://github.com/apache/lucene/pull/692#issuecomment-1046560283


   I open https://github.com/apache/lucene/pull/698 where I proposed a different way to compute const that reflect that the internal counter of DocIdSetBuilder refers to documents instead of values.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

Posted by GitBox <gi...@apache.org>.

jpountz commented on pull request #692:
URL: https://github.com/apache/lucene/pull/692#issuecomment-1047526553

In my opinion the API as it is today isn't bad. The only thing we might want to change is to make `DocIdSetBuilder#grow` take a long instead of an int.

Maybe it's a javadocs issue because `DocIdSetBuilder#grow` says that it returns "a `BulkAdder` object that can be used to add up to `numDocs` documents", which might suggest that `numDocs` is the number of unique documents contributed, when in fact this number is simply an upper bound of the number of times that you may call `BulkAdder#add` on the returned `BulkAdder` object.

> I'm still a bit confused about why we need to grow(long) on a bitset that can only hold Integer.MAX_VALUE elements.

This doesn't have anything to do with the `long counter` that you looked at.

The point of `BulkAdder#add` is to call it every time we find a matching (docID, value) pair, and the number of matching pairs may be larger than `Integer#MAX_VALUE` (e.g. a range over a multi-valued field that matches all docs but one), hence the long. This is the same reason why e.g. `SortedSetDocValues#nextOrd` returns a long.

> in the sparse/buffer case, wouldn't a much simpler estimation simply be the length of int array?

This is already the case today, see the `else` block in `DocIdSetBuilder#build`. The cost estimation logic only happens in the dense case when a `FixedBitSet` is used to hold the set of matching docs.

FWIW we could change the estimation logic to perform a popCount over a subset of the `FixedBitSet` and scale it according to the size of the bitset or something along these lines, if we think that it would be better than tracking this counter and dividing it by the number of values per doc.

> I'm also confused why we have this sorted array buffer case instead of using SparseFixedBitSet

`SparseFixedBitSet` is the right choice for the sparse case when you need something that implements the `BitSet` API. Here we only need to produce a `DocIdSet` and buffering doc IDs into an array and sorting them using radix sort proved to be faster than accumulating doc IDs into a `SparseFixedBitSet`.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] iverase commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

Posted by GitBox <gi...@apache.org>.

iverase commented on pull request #692:
URL: https://github.com/apache/lucene/pull/692#issuecomment-1069345096


   won't happen


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] iverase commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

Posted by GitBox <gi...@apache.org>.

iverase commented on pull request #692:
URL: https://github.com/apache/lucene/pull/692#issuecomment-1049887302


   32 bits will need to be discarded anyway, the issue is where. 
   
   You either do it at the PointValues level by calling grow like:
   
   ```
   visitor.grow((int) Math.min(getDocCount(), pointTree.size());
   ```
   
   Or you  discarded in the DocIdSetBuilder and allow grow to be called just like:
   
   ```
   visitor.grow(pointTree.size());
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

Posted by GitBox <gi...@apache.org>.

rmuir commented on pull request #692:
URL: https://github.com/apache/lucene/pull/692#issuecomment-1047752932


   Yeah, there seems to be some disagreement about what the code is actually doing. Probably because it is too confusing. Recommend (as i did before) to temporarily remove `counter` and cost estimation from here. Then you will see that 64 bits is not needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

Posted by GitBox <gi...@apache.org>.

rmuir commented on pull request #692:
URL: https://github.com/apache/lucene/pull/692#issuecomment-1047789904


   Seriously, let's remove this `counter` and cost estimation. @jpountz tells me I am wrong, but you can plainly see from the code, this issue is all about that. Everything else is only using 32 bits.
   
   If we remove the silly `counter` and bad cost estimator, it will be clear that adding a `long` to this API is not needed: nothing needs the extra 32 bits, nothing uses the extra 32 bits!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] LuXugang edited a comment on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

Posted by GitBox <gi...@apache.org>.

LuXugang edited a comment on pull request #692:
URL: https://github.com/apache/lucene/pull/692#issuecomment-1046438848


   > I'm also confused why we have this sorted array buffer case instead of using SparseFixedBitSet (which has approximateCardinality already and needs no such special grow-tracking).
   
   SparseFixedBitSet was truely used to gather docs in sparse case before [LUCENE-6645](https://issues.apache.org/jira/browse/LUCENE-6645).
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] LuXugang commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

Posted by GitBox <gi...@apache.org>.

LuXugang commented on pull request #692:
URL: https://github.com/apache/lucene/pull/692#issuecomment-1046438848


   > I'm also confused why we have this sorted array buffer case instead of using SparseFixedBitSet (which has approximateCardinality already and needs no such special grow-tracking).
   SparseFixedBitSet was truely used to gather docs in sparse case before [LUCENE-6645](https://issues.apache.org/jira/browse/LUCENE-6645).
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] iverase commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

Posted by GitBox <gi...@apache.org>.

iverase commented on pull request #692:
URL: https://github.com/apache/lucene/pull/692#issuecomment-1047786510


   If you go a bit higher top in that class:
   
   <img width="765" alt="image" src="https://user-images.githubusercontent.com/29038686/155139791-fb87fedb-22a0-44a7-86a6-60b6af84f177.png">
   
   We are throwing 32 bits there now? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #692: LUCENE-10311: Different implementations of DocIdSetBuilder for points and terms

Posted by GitBox <gi...@apache.org>.

jpountz commented on pull request #692:
URL: https://github.com/apache/lucene/pull/692#issuecomment-1049857125

@rmuir We can remove the cost estimation, but it will not address the problem. I'll try to explain the problem differently in case it helps.

DocIdSetBuilder takes doc IDs in random order with potential duplicates and creates a DocIdSet that can iterate over doc IDs in order without any duplicates. If you index a multi-valued field with points, a very large segment that has 2^30 docs might have 2^32 points matching a range query, which translates into 2^29 documents matching the query. So `DocIdBuilder#add` would be called 2^32 times and `DocIdSetBuilder#build` would result in a `DocIdSet` that has 2^29 documents. This `long` is measuring the number of calls to `DocIdSetBuilder#add`, hence the `long`.

The naming may be wrong here, as the `grow` name probably suggests a number of docs rather than a number of calls to `add`, similarly to how `ArrayUtil#grow` is about the number of items in the array - not the number of times you set an index. Hopefully renaming it to `prepareAdd(long numCallsToAdd)` or something along these lines would help clarify.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org