You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/11/09 17:29:41 UTC
[GitHub] [lucene] jpountz commented on a change in pull request #7: LUCENE-9820: Separate logic for reading the BKD index from logic to intersecting it

jpountz commented on a change in pull request #7:
URL: https://github.com/apache/lucene/pull/7#discussion_r745805971



##########
File path: lucene/core/src/java/org/apache/lucene/index/PointValues.java
##########
@@ -227,8 +228,59 @@ protected PointValues() {}
     CELL_CROSSES_QUERY
   };
 
+  /** Create a new {@link PointTree} to navigate the index */
+  public abstract PointTree getPointTree() throws IOException;
+
   /**
-   * We recurse the BKD tree, using a provided instance of this to guide the recursion.
+   * Basic operations to read the KD-tree.
+   *
+   * @lucene.experimental
+   */
+  public interface PointTree extends Cloneable {
+
+    /**
+     * Clone, the current node becomes the root of the new tree. The method should not be called
+     * after a successful call to {@link #moveToParent()}

Review comment:
       I still find it quite trappy that whether you can `clone()` doesn't depend on the node itself, but on how you got there. Sorry I think I asked several times already, but would it be an option to lift that restriction?

##########
File path: lucene/core/src/java/org/apache/lucene/util/bkd/BKDReader.java
##########
@@ -279,7 +437,91 @@ public int getNumLeaves() {
         numLeaves = rightMostLeafNode - leftMostLeafNode + 1 + leafNodeOffset;
       }
       assert numLeaves == getNumLeavesSlow(nodeID) : numLeaves + " " + getNumLeavesSlow(nodeID);
-      return numLeaves;
+      return rightMostLeafNode == this.rightMostLeafNode
+          ? (long) (numLeaves - 1) * config.maxPointsInLeafNode + lastLeafNodePointCount
+          : (long) numLeaves * config.maxPointsInLeafNode;
+    }
+
+    @Override
+    public void visitDocIDs(PointValues.IntersectVisitor visitor) throws IOException {
+      addAll(visitor, false);
+    }
+
+    public void addAll(PointValues.IntersectVisitor visitor, boolean grown) throws IOException {
+      if (grown == false) {
+        final long size = size();
+        if (size <= Integer.MAX_VALUE) {
+          visitor.grow((int) size);
+          grown = true;
+        }

Review comment:
       maybe leave a comment about the fact that if `size()` is greater than MAX_VALUE then we'll grow when recursing?

##########
File path: lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java
##########
@@ -328,31 +288,88 @@ public Relation compare(byte[] minPackedValue, byte[] maxPackedValue) {
           // Not deleted!
           docID = mappedDocID;
           System.arraycopy(
-              packedValues,
-              index * bkd.config.packedBytesLength,
-              state.scratchDataPackedValue,
+              mergeIntersectsVisitor.packedValues,
+              index * packedBytesLength,
+              packedValue,
               0,
-              bkd.config.packedBytesLength);
+              packedBytesLength);
+          return true;
+        }
+      }
+    }
+
+    private boolean collectNextLeaf() throws IOException {
+      assert pointTree.moveToChild() == false;
+      mergeIntersectsVisitor.reset();
+      do {
+        if (pointTree.moveToSibling()) {
+          // move to first child of this node and collect docs
+          while (pointTree.moveToChild()) {}
+          pointTree.visitDocValues(mergeIntersectsVisitor);
           return true;
         }
+      } while (pointTree.moveToParent());
+      return false;
+    }
+  }
+
+  private static class MergeIntersectsVisitor implements IntersectVisitor {
+
+    int docsInBlock = 0;
+    byte[] packedValues;
+    int[] docIDs;
+    private final int packedBytesLength;
+
+    MergeIntersectsVisitor(int packedBytesLength) {
+      this.docIDs = new int[0];
+      this.packedValues = new byte[0];
+      this.packedBytesLength = packedBytesLength;
+    }
+
+    void reset() {
+      docsInBlock = 0;
+    }
+
+    @Override
+    public void grow(int count) {
+      if (docIDs.length - docsInBlock < count) {
+        docIDs = ArrayUtil.grow(docIDs, count + docsInBlock);
+        packedValues = ArrayUtil.grow(packedValues, (count + docsInBlock) * packedBytesLength);

Review comment:
       I think you need to do this to make sure that you can never end up in a case where `docIDs` is large-enough, but not `packedValues`.
   ```suggestion
           packedValues = ArrayUtil.growExact(packedValues, docIDs.length * packedBytesLength);
   ```

##########
File path: lucene/core/src/java/org/apache/lucene/util/bkd/BKDWriter.java
##########
@@ -328,31 +288,88 @@ public Relation compare(byte[] minPackedValue, byte[] maxPackedValue) {
           // Not deleted!
           docID = mappedDocID;
           System.arraycopy(
-              packedValues,
-              index * bkd.config.packedBytesLength,
-              state.scratchDataPackedValue,
+              mergeIntersectsVisitor.packedValues,
+              index * packedBytesLength,
+              packedValue,
               0,
-              bkd.config.packedBytesLength);
+              packedBytesLength);
+          return true;
+        }
+      }
+    }
+
+    private boolean collectNextLeaf() throws IOException {
+      assert pointTree.moveToChild() == false;
+      mergeIntersectsVisitor.reset();
+      do {
+        if (pointTree.moveToSibling()) {
+          // move to first child of this node and collect docs
+          while (pointTree.moveToChild()) {}
+          pointTree.visitDocValues(mergeIntersectsVisitor);
           return true;
         }
+      } while (pointTree.moveToParent());
+      return false;
+    }
+  }
+
+  private static class MergeIntersectsVisitor implements IntersectVisitor {
+
+    int docsInBlock = 0;
+    byte[] packedValues;
+    int[] docIDs;
+    private final int packedBytesLength;
+
+    MergeIntersectsVisitor(int packedBytesLength) {
+      this.docIDs = new int[0];
+      this.packedValues = new byte[0];
+      this.packedBytesLength = packedBytesLength;
+    }
+
+    void reset() {
+      docsInBlock = 0;
+    }
+
+    @Override
+    public void grow(int count) {
+      if (docIDs.length - docsInBlock < count) {
+        docIDs = ArrayUtil.grow(docIDs, count + docsInBlock);
+        packedValues = ArrayUtil.grow(packedValues, (count + docsInBlock) * packedBytesLength);

Review comment:
       Do we need to care about not overflowing the max array length on `packedValues`?

##########
File path: lucene/core/src/java/org/apache/lucene/util/bkd/BKDReader.java
##########
@@ -279,7 +437,91 @@ public int getNumLeaves() {
         numLeaves = rightMostLeafNode - leftMostLeafNode + 1 + leafNodeOffset;
       }
       assert numLeaves == getNumLeavesSlow(nodeID) : numLeaves + " " + getNumLeavesSlow(nodeID);
-      return numLeaves;
+      return rightMostLeafNode == this.rightMostLeafNode
+          ? (long) (numLeaves - 1) * config.maxPointsInLeafNode + lastLeafNodePointCount
+          : (long) numLeaves * config.maxPointsInLeafNode;
+    }
+
+    @Override
+    public void visitDocIDs(PointValues.IntersectVisitor visitor) throws IOException {
+      addAll(visitor, false);
+    }
+
+    public void addAll(PointValues.IntersectVisitor visitor, boolean grown) throws IOException {
+      if (grown == false) {
+        final long size = size();
+        if (size <= Integer.MAX_VALUE) {
+          visitor.grow((int) size);
+          grown = true;
+        }

Review comment:
       I wonder about the implications of trying to `grow` as early as possible while the current implementation only grows at the leaf level. It looks like it could lead to loading the entire dataset in memory with our current merging logic? Maybe we should keep only growing at the leaf level for now and look into growing earlier in a follow-up?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org