You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/02/24 14:59:21 UTC

[GitHub] [lucene] rmuir opened a new pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

rmuir opened a new pull request #709:
URL: https://github.com/apache/lucene/pull/709


   Cost estimation drives the API complexity out of control, we don't need it. Hopefully i've cleared up all the API damage from this explosive leak.
   
   Instead, FixedBitSet.approximateCardinality() is used for cost estimation. TODO: let's optimize that!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] iverase commented on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

Posted by GitBox <gi...@apache.org>.

iverase commented on pull request #709:
URL: https://github.com/apache/lucene/pull/709#issuecomment-1050646004


   I remove the parameter `grown` from `addAll` in  [4c6b436](https://github.com/apache/lucene/pull/709/commits/4c6b436c7059742c80a7975a39b72494082c543c)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] iverase commented on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

Posted by GitBox <gi...@apache.org>.

iverase commented on pull request #709:
URL: https://github.com/apache/lucene/pull/709#issuecomment-1050884300


   What I want to make sure is this is covered in the javadocs and we are not relying on an implementation detail.
   
   #grow needs to be called with the number of times you are going to be calling BulkAdder#addDoc in order to make sure you don't overflow the sparse data structure. That should be added to the javadocs and maybe avoid the word documents that is causing all the confusion here.
   
   We can add that if grow is called with Integer.MAX_VALUE, there are not limit to the calls or something along those lines. 
   
   wdyt?
   
   Finally, we might need to modify the `AssertingLeafReader` as it asserts that you call #grow with the number of points you are going to visit, which in this case is not true all the time. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] iverase commented on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

Posted by GitBox <gi...@apache.org>.

iverase commented on pull request #709:
URL: https://github.com/apache/lucene/pull/709#issuecomment-1049999552


   I don't think the is necessary, we can always added to the IntersectVisitor instead. Maybe would be worthy to adjust how we call grow() in BKDReader#addAll as it does not need the dance it is currently doing:
   
   https://github.com/apache/lucene/blob/8c67a3816b9060fa983b494886cd4f789be1d868/lucene/core/src/java/org/apache/lucene/util/bkd/BKDReader.java#L562
   
   The same for  SimpleTextBKDReader#addAll


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

Posted by GitBox <gi...@apache.org>.

rmuir commented on pull request #709:
URL: https://github.com/apache/lucene/pull/709#issuecomment-1049967927


   If we want to add the `grow(long)` sugar method that simply truncates to `Integer.MAX_VALUE` and clean up all the points callsites, or write a cool FixedBitSet.approximateCardinality, just feel free to push commits here. Otherwise I will get to these two things later and remove draft status on the PR.
   
   Adding the sugar method is easy, it is just work.
   Implementing the approximateCardinality requires some thought and prolly some benchmarking. I had in mind to just "sample" some "chunks" of the long[] and sum up `Long.bitCount` across the ranges. In upcoming JDK this method will get vectorized, let's take advantage of that, so then both `cardinality()` and `approximateCardinality` would get faster: https://github.com/openjdk/jdk/pull/6857
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

Posted by GitBox <gi...@apache.org>.

rmuir commented on pull request #709:
URL: https://github.com/apache/lucene/pull/709#issuecomment-1058571689


   @iverase @jpountz I "undrafted" the PR and added a commit with the `grow(long)` that just truncates-n-forwards. It seems like the best compromise based on discussion above. 
   
   I also made some minor tweaks to the javadoc to try to simplify the explanation about what the grow parameter means. Again, it is kind of academic when you think about it, values larger than `maxDoc >> 8` are not really needed by any code because we switch to the `FixedBitSet`. But the one-liner method doesn't bother me that much, i am just after keeping logic simple and abstractions minimal.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] iverase commented on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

Posted by GitBox <gi...@apache.org>.

iverase commented on pull request #709:
URL: https://github.com/apache/lucene/pull/709#issuecomment-1050699040


   Oh, but that might still be not correct. The buffers implementation does not grow with unique documents but with every call of BulkAdder#add because it does not discard duplicates. What I did only works if I assume that providing Integer.MAX_VALUE, the builder can add any number of calls to BulkAdder#add. Something is not totally right here as Buffers requires to know how many calls you are gonna be doing to BulkAdder#add and not the number of unique documents you are adding.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on a change in pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

Posted by GitBox <gi...@apache.org>.

rmuir commented on a change in pull request #709:
URL: https://github.com/apache/lucene/pull/709#discussion_r819498097



##########
File path: lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java
##########
@@ -181,19 +144,27 @@ public void add(DocIdSetIterator iter) throws IOException {
   }
 
   /**
-   * Reserve space and return a {@link BulkAdder} object that can be used to add up to {@code
-   * numDocs} documents.
+   * Reserve space and return a {@link BulkAdder} object that supports up to {@code numAdds}
+   * invocations of {@link BulkAdder#add(int)}.
+   */
+  public BulkAdder grow(long numAdds) {
+    // as an impl detail, we don't need to care about numbers bigger than int32,
+    // we switch over to FixedBitSet storage well before that threshold.
+    return grow((int) Math.min(Integer.MAX_VALUE, numAdds));
+  }
+
+  /**
+   * Reserve space and return a {@link BulkAdder} object that supports up to {@code numAdds}
+   * invocations of {@link BulkAdder#add(int)}.
    */
-  public BulkAdder grow(int numDocs) {
+  public BulkAdder grow(int numAdds) {
     if (bitSet == null) {
-      if ((long) totalAllocated + numDocs <= threshold) {
-        ensureBufferCapacity(numDocs);
+      if ((long) totalAllocated + numAdds <= threshold) {
+        ensureBufferCapacity(numAdds);

Review comment:
       > Are we not casting numAdds here back to a long again? I am fine with it.
   
   come on man, all i did was rename local variable `numDocs` to `numAdds` to make things clear. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on a change in pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

Posted by GitBox <gi...@apache.org>.

rmuir commented on a change in pull request #709:
URL: https://github.com/apache/lucene/pull/709#discussion_r814040808



##########
File path: lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java
##########
@@ -266,20 +224,12 @@ private void upgradeToBitSet() {
   public DocIdSet build() {
     try {
       if (bitSet != null) {
-        assert counter >= 0;
-        final long cost = Math.round(counter / numValuesPerDoc);
-        return new BitDocIdSet(bitSet, cost);
+        return new BitDocIdSet(bitSet);

Review comment:
       I don't think it is difficult, it just requires a little work. I can get to it soon, seems like it should be fun. Ultimately I think it will give us better estimations than what we have today, without all the tangled APIs and abstraction leakage.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on a change in pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

Posted by GitBox <gi...@apache.org>.

rmuir commented on a change in pull request #709:
URL: https://github.com/apache/lucene/pull/709#discussion_r814039139



##########
File path: lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java
##########
@@ -266,20 +224,12 @@ private void upgradeToBitSet() {
   public DocIdSet build() {
     try {
       if (bitSet != null) {
-        assert counter >= 0;
-        final long cost = Math.round(counter / numValuesPerDoc);
-        return new BitDocIdSet(bitSet, cost);
+        return new BitDocIdSet(bitSet);
       } else {
         Buffer concatenated = concat(buffers);
         LSBRadixSorter sorter = new LSBRadixSorter();
         sorter.sort(PackedInts.bitsRequired(maxDoc - 1), concatenated.array, concatenated.length);
-        final int l;
-        if (multivalued) {
-          l = dedup(concatenated.array, concatenated.length);

Review comment:
       This optimization doesnt make sense to me. Buffers should only be used for tiny sets (they are very memory expensive).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] iverase commented on a change in pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

Posted by GitBox <gi...@apache.org>.

iverase commented on a change in pull request #709:
URL: https://github.com/apache/lucene/pull/709#discussion_r814045946



##########
File path: lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java
##########
@@ -266,20 +224,12 @@ private void upgradeToBitSet() {
   public DocIdSet build() {
     try {
       if (bitSet != null) {
-        assert counter >= 0;
-        final long cost = Math.round(counter / numValuesPerDoc);
-        return new BitDocIdSet(bitSet, cost);
+        return new BitDocIdSet(bitSet);
       } else {
         Buffer concatenated = concat(buffers);
         LSBRadixSorter sorter = new LSBRadixSorter();
         sorter.sort(PackedInts.bitsRequired(maxDoc - 1), concatenated.array, concatenated.length);
-        final int l;
-        if (multivalued) {
-          l = dedup(concatenated.array, concatenated.length);

Review comment:
       Ok, I am convinced. Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] iverase commented on a change in pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

Posted by GitBox <gi...@apache.org>.

iverase commented on a change in pull request #709:
URL: https://github.com/apache/lucene/pull/709#discussion_r813988648



##########
File path: lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java
##########
@@ -266,20 +224,12 @@ private void upgradeToBitSet() {
   public DocIdSet build() {
     try {
       if (bitSet != null) {
-        assert counter >= 0;
-        final long cost = Math.round(counter / numValuesPerDoc);
-        return new BitDocIdSet(bitSet, cost);
+        return new BitDocIdSet(bitSet);
       } else {
         Buffer concatenated = concat(buffers);
         LSBRadixSorter sorter = new LSBRadixSorter();
         sorter.sort(PackedInts.bitsRequired(maxDoc - 1), concatenated.array, concatenated.length);
-        final int l;
-        if (multivalued) {
-          l = dedup(concatenated.array, concatenated.length);

Review comment:
       Do we really want to throw away this optimisation? we normally know if our data is single or multi-valued so it seems wasteful not to exploit it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] iverase commented on a change in pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

Posted by GitBox <gi...@apache.org>.

iverase commented on a change in pull request #709:
URL: https://github.com/apache/lucene/pull/709#discussion_r814047234



##########
File path: lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java
##########
@@ -266,20 +224,12 @@ private void upgradeToBitSet() {
   public DocIdSet build() {
     try {
       if (bitSet != null) {
-        assert counter >= 0;
-        final long cost = Math.round(counter / numValuesPerDoc);
-        return new BitDocIdSet(bitSet, cost);
+        return new BitDocIdSet(bitSet);

Review comment:
       I like the idea of sampling, thanks




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] iverase commented on a change in pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

Posted by GitBox <gi...@apache.org>.

iverase commented on a change in pull request #709:
URL: https://github.com/apache/lucene/pull/709#discussion_r819367360



##########
File path: lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java
##########
@@ -181,19 +144,27 @@ public void add(DocIdSetIterator iter) throws IOException {
   }
 
   /**
-   * Reserve space and return a {@link BulkAdder} object that can be used to add up to {@code
-   * numDocs} documents.
+   * Reserve space and return a {@link BulkAdder} object that supports up to {@code numAdds}
+   * invocations of {@link BulkAdder#add(int)}.
+   */
+  public BulkAdder grow(long numAdds) {
+    // as an impl detail, we don't need to care about numbers bigger than int32,
+    // we switch over to FixedBitSet storage well before that threshold.
+    return grow((int) Math.min(Integer.MAX_VALUE, numAdds));
+  }
+
+  /**
+   * Reserve space and return a {@link BulkAdder} object that supports up to {@code numAdds}
+   * invocations of {@link BulkAdder#add(int)}.
    */
-  public BulkAdder grow(int numDocs) {
+  public BulkAdder grow(int numAdds) {
     if (bitSet == null) {
-      if ((long) totalAllocated + numDocs <= threshold) {
-        ensureBufferCapacity(numDocs);
+      if ((long) totalAllocated + numAdds <= threshold) {
+        ensureBufferCapacity(numAdds);

Review comment:
       For some reason I thought this method become private, I find weird to have two methods  #grow(long) and  #grow(int)  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

Posted by GitBox <gi...@apache.org>.

rmuir commented on pull request #709:
URL: https://github.com/apache/lucene/pull/709#issuecomment-1050188471


   > I don't think the grow(long) is necessary, we can always added to the IntersectVisitor instead. Maybe would be worthy to adjust how we call grow() in BKDReader#addAll as it does not need the dance it is currently doing
   
   Sorry, I'm not so familiar with the code in question. Does it mean we can remove the `grown` parameter here and the split logic around it for the `addAll()` method? If so, that sounds great!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

Posted by GitBox <gi...@apache.org>.

rmuir commented on pull request #709:
URL: https://github.com/apache/lucene/pull/709#issuecomment-1052127011


   if we add `grow(long)` that simply truncates and forwards, then it encapsulates this within this class. The code stays simple and the caller doesn't need to know about it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

Posted by GitBox <gi...@apache.org>.

jpountz commented on pull request #709:
URL: https://github.com/apache/lucene/pull/709#issuecomment-1049959940


   That change makes sense to me. FWIW my recollection from profiling DocIdSetBuilder is that the deduplication logic is cheap and most of the time is spent in `LSBRadixSorter#reorder` so it's ok to always deduplicate.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] iverase edited a comment on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

Posted by GitBox <gi...@apache.org>.

iverase edited a comment on pull request #709:
URL: https://github.com/apache/lucene/pull/709#issuecomment-1049999552


   I don't think the grow(long) is necessary, we can always added to the IntersectVisitor instead. Maybe would be worthy to adjust how we call grow() in BKDReader#addAll as it does not need the dance it is currently doing:
   
   https://github.com/apache/lucene/blob/8c67a3816b9060fa983b494886cd4f789be1d868/lucene/core/src/java/org/apache/lucene/util/bkd/BKDReader.java#L562
   
   The same for  SimpleTextBKDReader#addAll


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

Posted by GitBox <gi...@apache.org>.

rmuir commented on pull request #709:
URL: https://github.com/apache/lucene/pull/709#issuecomment-1050869655


   There's no way we're allowing more than `Integer.MAX_VALUE` calls going to this buffers thing. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

Posted by GitBox <gi...@apache.org>.

rmuir commented on pull request #709:
URL: https://github.com/apache/lucene/pull/709#issuecomment-1050876486


   seriously, look at `threshold`. its `maxDoc >>> 7`. `maxDoc` is an int.
   
   when you call `grow(anywhere close to Integer.MAX_VALUE)` then buffers exits the stage permanently.
   
   64 bits are not needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] iverase commented on a change in pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

Posted by GitBox <gi...@apache.org>.

iverase commented on a change in pull request #709:
URL: https://github.com/apache/lucene/pull/709#discussion_r819354785



##########
File path: lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java
##########
@@ -181,19 +144,27 @@ public void add(DocIdSetIterator iter) throws IOException {
   }
 
   /**
-   * Reserve space and return a {@link BulkAdder} object that can be used to add up to {@code
-   * numDocs} documents.
+   * Reserve space and return a {@link BulkAdder} object that supports up to {@code numAdds}
+   * invocations of {@link BulkAdder#add(int)}.
+   */
+  public BulkAdder grow(long numAdds) {
+    // as an impl detail, we don't need to care about numbers bigger than int32,
+    // we switch over to FixedBitSet storage well before that threshold.
+    return grow((int) Math.min(Integer.MAX_VALUE, numAdds));
+  }
+
+  /**
+   * Reserve space and return a {@link BulkAdder} object that supports up to {@code numAdds}
+   * invocations of {@link BulkAdder#add(int)}.
    */
-  public BulkAdder grow(int numDocs) {
+  public BulkAdder grow(int numAdds) {
     if (bitSet == null) {
-      if ((long) totalAllocated + numDocs <= threshold) {
-        ensureBufferCapacity(numDocs);
+      if ((long) totalAllocated + numAdds <= threshold) {
+        ensureBufferCapacity(numAdds);

Review comment:
       Are we not casting numAdds here back to a long again? I am fine with it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

Posted by GitBox <gi...@apache.org>.

rmuir commented on pull request #709:
URL: https://github.com/apache/lucene/pull/709#issuecomment-1059085772


   For the record this DocIdSetBuilder.Buffer has been so damaging to our code, insanely, I'm still here trying to calm down the explosion of horribleness caused by it. 
   
   I opened https://issues.apache.org/jira/browse/LUCENE-10443 as a second pass to this PR to try to really dig into the situation. Personally I am in favor of switching back to SparseFixedBitSet. 
   
   Sticking with bitsets help defend against so many bad decisions such as bringing `long` into these apis when its not needed. I actually don't mind if the performance drops a bit to use the `SparseFixedBitSet` instead of this horrible "buffer". We have to take care of complexity.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on a change in pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

Posted by GitBox <gi...@apache.org>.

rmuir commented on a change in pull request #709:
URL: https://github.com/apache/lucene/pull/709#discussion_r819493935



##########
File path: lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java
##########
@@ -181,19 +144,27 @@ public void add(DocIdSetIterator iter) throws IOException {
   }
 
   /**
-   * Reserve space and return a {@link BulkAdder} object that can be used to add up to {@code
-   * numDocs} documents.
+   * Reserve space and return a {@link BulkAdder} object that supports up to {@code numAdds}
+   * invocations of {@link BulkAdder#add(int)}.
+   */
+  public BulkAdder grow(long numAdds) {
+    // as an impl detail, we don't need to care about numbers bigger than int32,
+    // we switch over to FixedBitSet storage well before that threshold.
+    return grow((int) Math.min(Integer.MAX_VALUE, numAdds));
+  }
+
+  /**
+   * Reserve space and return a {@link BulkAdder} object that supports up to {@code numAdds}
+   * invocations of {@link BulkAdder#add(int)}.
    */
-  public BulkAdder grow(int numDocs) {
+  public BulkAdder grow(int numAdds) {
     if (bitSet == null) {
-      if ((long) totalAllocated + numDocs <= threshold) {
-        ensureBufferCapacity(numDocs);
+      if ((long) totalAllocated + numAdds <= threshold) {
+        ensureBufferCapacity(numAdds);

Review comment:
       i find it weird to have a `grow(long)` at all when no sizes above ~8.3 million matter at all. But I'm trying to compromise here. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] iverase commented on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

Posted by GitBox <gi...@apache.org>.

iverase commented on pull request #709:
URL: https://github.com/apache/lucene/pull/709#issuecomment-1059725486


   I reverted the changes to the BKD tree as it is inconsistent with the current AssertingLeafReader implementation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] iverase commented on a change in pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

Posted by GitBox <gi...@apache.org>.

iverase commented on a change in pull request #709:
URL: https://github.com/apache/lucene/pull/709#discussion_r813994000



##########
File path: lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java
##########
@@ -266,20 +224,12 @@ private void upgradeToBitSet() {
   public DocIdSet build() {
     try {
       if (bitSet != null) {
-        assert counter >= 0;
-        final long cost = Math.round(counter / numValuesPerDoc);
-        return new BitDocIdSet(bitSet, cost);
+        return new BitDocIdSet(bitSet);

Review comment:
       we still ned to implement the method estimateCardinality which is the hard bit.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

Posted by GitBox <gi...@apache.org>.

rmuir commented on pull request #709:
URL: https://github.com/apache/lucene/pull/709#issuecomment-1049948027


   Here's a first stab of what i proposed on https://github.com/apache/lucene/pull/692
   
   You can see how damaging the current cost() implementation is.
   
   As followup commits we can add the `grow(long)` sugar that simply truncates. And we should optimize `FixedBitSet.approximateCardinality()`. After doing that, we should look around and see if there is any other similar damage to our APIs related to the fact that FixedBitSet had a slow `approximateCardinality` and fix those, too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] iverase commented on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

Posted by GitBox <gi...@apache.org>.

iverase commented on pull request #709:
URL: https://github.com/apache/lucene/pull/709#issuecomment-1053335430


   +1 That would hide the implementation details from users.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] iverase edited a comment on pull request #709: LUCENE-10311: remove complex cost estimation and abstraction leakage around it

Posted by GitBox <gi...@apache.org>.

iverase edited a comment on pull request #709:
URL: https://github.com/apache/lucene/pull/709#issuecomment-1050884300


   What I want to make sure is this is covered in the javadocs and we are not relying on an implementation detail.
   
   #grow needs to be called with the number of times you are going to be calling BulkAdder#addDoc in order to make sure you don't overflow the sparse data structure. That should be added to the javadocs and maybe avoid the word documents that is causing all the confusion here.
   
   We can add that if grow is called with Integer.MAX_VALUE, there is not limit to the calls to BulkAdder#addDoc or something along those lines. 
   
   wdyt?
   
   Finally, we might need to modify the `AssertingLeafReader` as it asserts that you call #grow with the number of points you are going to visit, which in this case is not true all the time. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org