You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/09/20 13:57:16 UTC
[GitHub] [lucene] jpountz commented on a diff in pull request #687: LUCENE-10425：speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search

jpountz commented on code in PR #687:
URL: https://github.com/apache/lucene/pull/687#discussion_r975393095


##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/IndexSortSortedNumericDocValuesRangeQuery.java:
##########
@@ -214,12 +220,166 @@ public int count(LeafReaderContext context) throws IOException {
     };
   }
 
+  /**
+   * Returns the first document whose packed value is greater than or equal (if allowEqual is true)
+   * to the provided packed value or -1 if all packed values are smaller than the provided one,
+   */
+  public final int nextDoc(PointValues values, byte[] packedValue, boolean allowEqual)
+      throws IOException {
+    assert values.getNumDimensions() == 1;
+    final int bytesPerDim = values.getBytesPerDimension();
+    final ByteArrayComparator comparator = ArrayUtil.getUnsignedComparator(bytesPerDim);
+    final Predicate<byte[]> biggerThan =
+        testPackedValue -> {
+          if (allowEqual) {
+            if (comparator.compare(testPackedValue, 0, packedValue, 0) < 0) {
+              return false;
+            }
+          } else {
+            if (comparator.compare(testPackedValue, 0, packedValue, 0) <= 0) {
+              return false;
+            }
+          }
+          return true;
+        };
+    return nextDoc(values.getPointTree(), biggerThan);
+  }
+
+  private int nextDoc(PointValues.PointTree pointTree, Predicate<byte[]> biggerThan)
+      throws IOException {
+    if (biggerThan.test(pointTree.getMaxPackedValue()) == false) {
+      // doc is before us
+      return -1;
+    } else if (pointTree.moveToChild()) {
+      // navigate down
+      do {
+        final int doc = nextDoc(pointTree, biggerThan);
+        if (doc != -1) {
+          return doc;
+        }
+      } while (pointTree.moveToSibling());
+      pointTree.moveToParent();
+      return -1;
+    } else {
+      // doc is in this leaf
+      final int[] doc = {-1};
+      pointTree.visitDocValues(
+          new IntersectVisitor() {
+            @Override
+            public void visit(int docID) {
+              throw new AssertionError("Invalid call to visit(docID)");
+            }
+
+            @Override
+            public void visit(int docID, byte[] packedValue) {
+              if (doc[0] == -1 && biggerThan.test(packedValue)) {
+                doc[0] = docID;
+              }
+            }
+
+            @Override
+            public Relation compare(byte[] minPackedValue, byte[] maxPackedValue) {
+              return Relation.CELL_CROSSES_QUERY;
+            }
+          });
+      return doc[0];
+    }
+  }
+
+  private boolean matchNone(PointValues points, byte[] queryLowerPoint, byte[] queryUpperPoint)
+      throws IOException {
+    final ByteArrayComparator comparator =
+        ArrayUtil.getUnsignedComparator(points.getBytesPerDimension());
+    for (int dim = 0; dim < points.getNumDimensions(); dim++) {
+      int offset = dim * points.getBytesPerDimension();
+      if (comparator.compare(points.getMinPackedValue(), offset, queryUpperPoint, offset) > 0
+          || comparator.compare(points.getMaxPackedValue(), offset, queryLowerPoint, offset) < 0) {
+        return true;
+      }
+    }
+    return false;
+  }
+
+  private boolean matchAll(PointValues points, byte[] queryLowerPoint, byte[] queryUpperPoint)
+      throws IOException {
+    final ByteArrayComparator comparator =
+        ArrayUtil.getUnsignedComparator(points.getBytesPerDimension());
+    for (int dim = 0; dim < points.getNumDimensions(); dim++) {
+      int offset = dim * points.getBytesPerDimension();
+      if (comparator.compare(points.getMinPackedValue(), offset, queryUpperPoint, offset) > 0) {
+        return false;
+      }
+      if (comparator.compare(points.getMaxPackedValue(), offset, queryLowerPoint, offset) < 0) {
+        return false;
+      }
+      if (comparator.compare(points.getMinPackedValue(), offset, queryLowerPoint, offset) < 0
+          || comparator.compare(points.getMaxPackedValue(), offset, queryUpperPoint, offset) > 0) {
+        return false;
+      }
+    }
+    return true;
+  }
+
+  private BoundedDocIdSetIterator getDocIdSetIteratorOrNullFromBkd(
+      LeafReaderContext context, DocIdSetIterator delegate) throws IOException {
+    Sort indexSort = context.reader().getMetaData().getSort();
+    if (indexSort != null
+        && indexSort.getSort().length > 0
+        && indexSort.getSort()[0].getField().equals(field)
+        && indexSort.getSort()[0].getReverse() == false) {
+      PointValues points = context.reader().getPointValues(field);
+      if (points == null) {
+        return null;
+      }
+
+      // Each doc that has points has exactly one point.
+      if (points.size() == points.getDocCount()) {
+        if (points.getDocCount() == context.reader().maxDoc()) {
+          delegate = null;
+        }
+
+        byte[] queryLowerPoint = LongPoint.pack(lowerValue).bytes;
+        byte[] queryUpperPoint = LongPoint.pack(upperValue).bytes;
+        if (matchNone(points, queryLowerPoint, queryUpperPoint)) {
+          return new BoundedDocIdSetIterator(-1, -1, delegate);
+        }
+        if (matchAll(points, queryLowerPoint, queryUpperPoint)) {
+          return new BoundedDocIdSetIterator(0, points.getDocCount(), delegate);

Review Comment:
   This doesn't look correct for me on sparse fields, where there might be matching doc IDs between `points.getDocCount()` and `reader.maxDoc()`?



##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/IndexSortSortedNumericDocValuesRangeQuery.java:
##########
@@ -214,12 +220,166 @@ public int count(LeafReaderContext context) throws IOException {
     };
   }
 
+  /**
+   * Returns the first document whose packed value is greater than or equal (if allowEqual is true)
+   * to the provided packed value or -1 if all packed values are smaller than the provided one,
+   */
+  public final int nextDoc(PointValues values, byte[] packedValue, boolean allowEqual)
+      throws IOException {
+    assert values.getNumDimensions() == 1;
+    final int bytesPerDim = values.getBytesPerDimension();
+    final ByteArrayComparator comparator = ArrayUtil.getUnsignedComparator(bytesPerDim);
+    final Predicate<byte[]> biggerThan =
+        testPackedValue -> {
+          if (allowEqual) {
+            if (comparator.compare(testPackedValue, 0, packedValue, 0) < 0) {
+              return false;
+            }
+          } else {
+            if (comparator.compare(testPackedValue, 0, packedValue, 0) <= 0) {
+              return false;
+            }
+          }
+          return true;

Review Comment:
   I think it would be a bit easier to read for me if we did something like that instead
   
   ```java
   int cmp = comparator.compare(testPackedValue, 0, packedValue, 0);
   return cmp > 0 || (cmp == 0 && allowEqual);
   ```



##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/IndexSortSortedNumericDocValuesRangeQuery.java:
##########
@@ -214,12 +220,166 @@ public int count(LeafReaderContext context) throws IOException {
     };
   }
 
+  /**
+   * Returns the first document whose packed value is greater than or equal (if allowEqual is true)
+   * to the provided packed value or -1 if all packed values are smaller than the provided one,
+   */
+  public final int nextDoc(PointValues values, byte[] packedValue, boolean allowEqual)
+      throws IOException {
+    assert values.getNumDimensions() == 1;
+    final int bytesPerDim = values.getBytesPerDimension();
+    final ByteArrayComparator comparator = ArrayUtil.getUnsignedComparator(bytesPerDim);
+    final Predicate<byte[]> biggerThan =
+        testPackedValue -> {
+          if (allowEqual) {
+            if (comparator.compare(testPackedValue, 0, packedValue, 0) < 0) {
+              return false;
+            }
+          } else {
+            if (comparator.compare(testPackedValue, 0, packedValue, 0) <= 0) {
+              return false;
+            }
+          }
+          return true;
+        };
+    return nextDoc(values.getPointTree(), biggerThan);
+  }
+
+  private int nextDoc(PointValues.PointTree pointTree, Predicate<byte[]> biggerThan)
+      throws IOException {
+    if (biggerThan.test(pointTree.getMaxPackedValue()) == false) {
+      // doc is before us
+      return -1;
+    } else if (pointTree.moveToChild()) {
+      // navigate down
+      do {
+        final int doc = nextDoc(pointTree, biggerThan);
+        if (doc != -1) {
+          return doc;
+        }
+      } while (pointTree.moveToSibling());
+      pointTree.moveToParent();
+      return -1;
+    } else {
+      // doc is in this leaf
+      final int[] doc = {-1};
+      pointTree.visitDocValues(
+          new IntersectVisitor() {
+            @Override
+            public void visit(int docID) {
+              throw new AssertionError("Invalid call to visit(docID)");
+            }
+
+            @Override
+            public void visit(int docID, byte[] packedValue) {
+              if (doc[0] == -1 && biggerThan.test(packedValue)) {
+                doc[0] = docID;
+              }
+            }
+
+            @Override
+            public Relation compare(byte[] minPackedValue, byte[] maxPackedValue) {
+              return Relation.CELL_CROSSES_QUERY;
+            }
+          });
+      return doc[0];
+    }
+  }
+
+  private boolean matchNone(PointValues points, byte[] queryLowerPoint, byte[] queryUpperPoint)
+      throws IOException {
+    final ByteArrayComparator comparator =
+        ArrayUtil.getUnsignedComparator(points.getBytesPerDimension());
+    for (int dim = 0; dim < points.getNumDimensions(); dim++) {
+      int offset = dim * points.getBytesPerDimension();
+      if (comparator.compare(points.getMinPackedValue(), offset, queryUpperPoint, offset) > 0
+          || comparator.compare(points.getMaxPackedValue(), offset, queryLowerPoint, offset) < 0) {
+        return true;
+      }
+    }
+    return false;
+  }
+
+  private boolean matchAll(PointValues points, byte[] queryLowerPoint, byte[] queryUpperPoint)
+      throws IOException {
+    final ByteArrayComparator comparator =
+        ArrayUtil.getUnsignedComparator(points.getBytesPerDimension());
+    for (int dim = 0; dim < points.getNumDimensions(); dim++) {
+      int offset = dim * points.getBytesPerDimension();
+      if (comparator.compare(points.getMinPackedValue(), offset, queryUpperPoint, offset) > 0) {
+        return false;
+      }
+      if (comparator.compare(points.getMaxPackedValue(), offset, queryLowerPoint, offset) < 0) {
+        return false;
+      }
+      if (comparator.compare(points.getMinPackedValue(), offset, queryLowerPoint, offset) < 0
+          || comparator.compare(points.getMaxPackedValue(), offset, queryUpperPoint, offset) > 0) {
+        return false;
+      }
+    }
+    return true;
+  }
+
+  private BoundedDocIdSetIterator getDocIdSetIteratorOrNullFromBkd(
+      LeafReaderContext context, DocIdSetIterator delegate) throws IOException {
+    Sort indexSort = context.reader().getMetaData().getSort();
+    if (indexSort != null
+        && indexSort.getSort().length > 0
+        && indexSort.getSort()[0].getField().equals(field)
+        && indexSort.getSort()[0].getReverse() == false) {
+      PointValues points = context.reader().getPointValues(field);
+      if (points == null) {
+        return null;
+      }
+
+      // Each doc that has points has exactly one point.
+      if (points.size() == points.getDocCount()) {
+        if (points.getDocCount() == context.reader().maxDoc()) {
+          delegate = null;
+        }
+
+        byte[] queryLowerPoint = LongPoint.pack(lowerValue).bytes;
+        byte[] queryUpperPoint = LongPoint.pack(upperValue).bytes;
+        if (matchNone(points, queryLowerPoint, queryUpperPoint)) {
+          return new BoundedDocIdSetIterator(-1, -1, delegate);

Review Comment:
   Nit: I would prefer passing 0 and 0 rather than -1 and -1 since -1 is not a valid doc ID.



##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/IndexSortSortedNumericDocValuesRangeQuery.java:
##########
@@ -214,12 +220,166 @@ public int count(LeafReaderContext context) throws IOException {
     };
   }
 
+  /**
+   * Returns the first document whose packed value is greater than or equal (if allowEqual is true)
+   * to the provided packed value or -1 if all packed values are smaller than the provided one,
+   */
+  public final int nextDoc(PointValues values, byte[] packedValue, boolean allowEqual)
+      throws IOException {
+    assert values.getNumDimensions() == 1;
+    final int bytesPerDim = values.getBytesPerDimension();
+    final ByteArrayComparator comparator = ArrayUtil.getUnsignedComparator(bytesPerDim);
+    final Predicate<byte[]> biggerThan =
+        testPackedValue -> {
+          if (allowEqual) {
+            if (comparator.compare(testPackedValue, 0, packedValue, 0) < 0) {
+              return false;
+            }
+          } else {
+            if (comparator.compare(testPackedValue, 0, packedValue, 0) <= 0) {
+              return false;
+            }
+          }
+          return true;
+        };
+    return nextDoc(values.getPointTree(), biggerThan);
+  }
+
+  private int nextDoc(PointValues.PointTree pointTree, Predicate<byte[]> biggerThan)
+      throws IOException {
+    if (biggerThan.test(pointTree.getMaxPackedValue()) == false) {
+      // doc is before us
+      return -1;
+    } else if (pointTree.moveToChild()) {
+      // navigate down
+      do {
+        final int doc = nextDoc(pointTree, biggerThan);
+        if (doc != -1) {
+          return doc;
+        }
+      } while (pointTree.moveToSibling());
+      pointTree.moveToParent();
+      return -1;
+    } else {
+      // doc is in this leaf
+      final int[] doc = {-1};
+      pointTree.visitDocValues(
+          new IntersectVisitor() {
+            @Override
+            public void visit(int docID) {
+              throw new AssertionError("Invalid call to visit(docID)");
+            }
+
+            @Override
+            public void visit(int docID, byte[] packedValue) {
+              if (doc[0] == -1 && biggerThan.test(packedValue)) {
+                doc[0] = docID;
+              }
+            }
+
+            @Override
+            public Relation compare(byte[] minPackedValue, byte[] maxPackedValue) {
+              return Relation.CELL_CROSSES_QUERY;
+            }
+          });
+      return doc[0];
+    }
+  }
+
+  private boolean matchNone(PointValues points, byte[] queryLowerPoint, byte[] queryUpperPoint)
+      throws IOException {
+    final ByteArrayComparator comparator =
+        ArrayUtil.getUnsignedComparator(points.getBytesPerDimension());
+    for (int dim = 0; dim < points.getNumDimensions(); dim++) {
+      int offset = dim * points.getBytesPerDimension();
+      if (comparator.compare(points.getMinPackedValue(), offset, queryUpperPoint, offset) > 0
+          || comparator.compare(points.getMaxPackedValue(), offset, queryLowerPoint, offset) < 0) {
+        return true;
+      }
+    }
+    return false;
+  }
+
+  private boolean matchAll(PointValues points, byte[] queryLowerPoint, byte[] queryUpperPoint)
+      throws IOException {
+    final ByteArrayComparator comparator =
+        ArrayUtil.getUnsignedComparator(points.getBytesPerDimension());
+    for (int dim = 0; dim < points.getNumDimensions(); dim++) {
+      int offset = dim * points.getBytesPerDimension();
+      if (comparator.compare(points.getMinPackedValue(), offset, queryUpperPoint, offset) > 0) {
+        return false;
+      }
+      if (comparator.compare(points.getMaxPackedValue(), offset, queryLowerPoint, offset) < 0) {
+        return false;
+      }
+      if (comparator.compare(points.getMinPackedValue(), offset, queryLowerPoint, offset) < 0
+          || comparator.compare(points.getMaxPackedValue(), offset, queryUpperPoint, offset) > 0) {
+        return false;
+      }
+    }
+    return true;
+  }
+
+  private BoundedDocIdSetIterator getDocIdSetIteratorOrNullFromBkd(
+      LeafReaderContext context, DocIdSetIterator delegate) throws IOException {
+    Sort indexSort = context.reader().getMetaData().getSort();
+    if (indexSort != null
+        && indexSort.getSort().length > 0
+        && indexSort.getSort()[0].getField().equals(field)
+        && indexSort.getSort()[0].getReverse() == false) {
+      PointValues points = context.reader().getPointValues(field);
+      if (points == null) {
+        return null;
+      }
+
+      // Each doc that has points has exactly one point.
+      if (points.size() == points.getDocCount()) {
+        if (points.getDocCount() == context.reader().maxDoc()) {
+          delegate = null;
+        }
+
+        byte[] queryLowerPoint = LongPoint.pack(lowerValue).bytes;

Review Comment:
   Maybe disable this optimization if the number of bytes per dimensions is not 8, so that it doesn't fail on integer fields?



##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/IndexSortSortedNumericDocValuesRangeQuery.java:
##########
@@ -214,12 +220,166 @@ public int count(LeafReaderContext context) throws IOException {
     };
   }
 
+  /**
+   * Returns the first document whose packed value is greater than or equal (if allowEqual is true)
+   * to the provided packed value or -1 if all packed values are smaller than the provided one,
+   */
+  public final int nextDoc(PointValues values, byte[] packedValue, boolean allowEqual)
+      throws IOException {
+    assert values.getNumDimensions() == 1;
+    final int bytesPerDim = values.getBytesPerDimension();
+    final ByteArrayComparator comparator = ArrayUtil.getUnsignedComparator(bytesPerDim);
+    final Predicate<byte[]> biggerThan =
+        testPackedValue -> {
+          if (allowEqual) {
+            if (comparator.compare(testPackedValue, 0, packedValue, 0) < 0) {
+              return false;
+            }
+          } else {
+            if (comparator.compare(testPackedValue, 0, packedValue, 0) <= 0) {
+              return false;
+            }
+          }
+          return true;
+        };
+    return nextDoc(values.getPointTree(), biggerThan);
+  }
+
+  private int nextDoc(PointValues.PointTree pointTree, Predicate<byte[]> biggerThan)
+      throws IOException {
+    if (biggerThan.test(pointTree.getMaxPackedValue()) == false) {
+      // doc is before us
+      return -1;
+    } else if (pointTree.moveToChild()) {
+      // navigate down
+      do {
+        final int doc = nextDoc(pointTree, biggerThan);
+        if (doc != -1) {
+          return doc;
+        }
+      } while (pointTree.moveToSibling());
+      pointTree.moveToParent();
+      return -1;
+    } else {
+      // doc is in this leaf
+      final int[] doc = {-1};
+      pointTree.visitDocValues(
+          new IntersectVisitor() {
+            @Override
+            public void visit(int docID) {
+              throw new AssertionError("Invalid call to visit(docID)");
+            }
+
+            @Override
+            public void visit(int docID, byte[] packedValue) {
+              if (doc[0] == -1 && biggerThan.test(packedValue)) {
+                doc[0] = docID;
+              }
+            }
+
+            @Override
+            public Relation compare(byte[] minPackedValue, byte[] maxPackedValue) {
+              return Relation.CELL_CROSSES_QUERY;
+            }
+          });
+      return doc[0];
+    }
+  }
+
+  private boolean matchNone(PointValues points, byte[] queryLowerPoint, byte[] queryUpperPoint)
+      throws IOException {
+    final ByteArrayComparator comparator =
+        ArrayUtil.getUnsignedComparator(points.getBytesPerDimension());
+    for (int dim = 0; dim < points.getNumDimensions(); dim++) {
+      int offset = dim * points.getBytesPerDimension();
+      if (comparator.compare(points.getMinPackedValue(), offset, queryUpperPoint, offset) > 0
+          || comparator.compare(points.getMaxPackedValue(), offset, queryLowerPoint, offset) < 0) {
+        return true;
+      }
+    }
+    return false;
+  }
+
+  private boolean matchAll(PointValues points, byte[] queryLowerPoint, byte[] queryUpperPoint)
+      throws IOException {
+    final ByteArrayComparator comparator =
+        ArrayUtil.getUnsignedComparator(points.getBytesPerDimension());
+    for (int dim = 0; dim < points.getNumDimensions(); dim++) {
+      int offset = dim * points.getBytesPerDimension();
+      if (comparator.compare(points.getMinPackedValue(), offset, queryUpperPoint, offset) > 0) {
+        return false;
+      }
+      if (comparator.compare(points.getMaxPackedValue(), offset, queryLowerPoint, offset) < 0) {
+        return false;
+      }
+      if (comparator.compare(points.getMinPackedValue(), offset, queryLowerPoint, offset) < 0
+          || comparator.compare(points.getMaxPackedValue(), offset, queryUpperPoint, offset) > 0) {
+        return false;
+      }
+    }
+    return true;
+  }
+
+  private BoundedDocIdSetIterator getDocIdSetIteratorOrNullFromBkd(
+      LeafReaderContext context, DocIdSetIterator delegate) throws IOException {
+    Sort indexSort = context.reader().getMetaData().getSort();
+    if (indexSort != null
+        && indexSort.getSort().length > 0
+        && indexSort.getSort()[0].getField().equals(field)
+        && indexSort.getSort()[0].getReverse() == false) {
+      PointValues points = context.reader().getPointValues(field);
+      if (points == null) {
+        return null;
+      }
+
+      // Each doc that has points has exactly one point.
+      if (points.size() == points.getDocCount()) {
+        if (points.getDocCount() == context.reader().maxDoc()) {
+          delegate = null;
+        }
+
+        byte[] queryLowerPoint = LongPoint.pack(lowerValue).bytes;
+        byte[] queryUpperPoint = LongPoint.pack(upperValue).bytes;
+        if (matchNone(points, queryLowerPoint, queryUpperPoint)) {
+          return new BoundedDocIdSetIterator(-1, -1, delegate);
+        }
+        if (matchAll(points, queryLowerPoint, queryUpperPoint)) {
+          return new BoundedDocIdSetIterator(0, points.getDocCount(), delegate);
+        }
+        if (points.getNumDimensions() != 1) {
+          return null;
+        }
+        // >=queryLowerPoint
+        int minDocId = nextDoc(points, queryLowerPoint, true);
+        if (minDocId == -1) {
+          return new BoundedDocIdSetIterator(-1, -1, delegate);
+        }
+        // >queryUpperPoint,
+        int maxDocId = nextDoc(points, queryUpperPoint, false);
+        if (maxDocId == -1) {
+          maxDocId = context.reader().numDocs() - 1;
+        } else {
+          // return maxDocId the smallest doc id whose value >queryUpperPoint, so maxDocId-1 is
+          // biggest doc id whose value
+          // <=queryUpperPoint
+          maxDocId = maxDocId - 1;
+        }
+        return new BoundedDocIdSetIterator(minDocId, maxDocId + 1, delegate);

Review Comment:
   could we remove the `+ 1` here as well as the two `- 1` in the above `if/else` blocks?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org