You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/02/23 10:09:54 UTC
[GitHub] [lucene] jpountz commented on a change in pull request #700: LUCENE-10382: Ensure kNN filtering works with other codecs

jpountz commented on a change in pull request #700:
URL: https://github.com/apache/lucene/pull/700#discussion_r812722194



##########
File path: lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java
##########
@@ -821,6 +822,64 @@ public void testRandom() throws Exception {
     }
   }
 
+  public void testSearchWithVisitedLimit() throws Exception {
+    IndexWriterConfig iwc = newIndexWriterConfig();
+    String fieldName = "field";
+    try (Directory dir = newDirectory();
+        IndexWriter iw = new IndexWriter(dir, iwc)) {
+      int numDoc = atLeast(300);
+      int dimension = atLeast(10);
+      for (int i = 0; i < numDoc; i++) {
+        int id = random().nextInt(numDoc);

Review comment:
       maybe just use `i` as the `id`, it makes things a bit confusing to me that some docs might have duplicate ids, and that some of the delete-by-term calls below might not delete any document?

##########
File path: lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java
##########
@@ -821,6 +822,64 @@ public void testRandom() throws Exception {
     }
   }
 
+  public void testSearchWithVisitedLimit() throws Exception {
+    IndexWriterConfig iwc = newIndexWriterConfig();
+    String fieldName = "field";
+    try (Directory dir = newDirectory();
+        IndexWriter iw = new IndexWriter(dir, iwc)) {
+      int numDoc = atLeast(300);
+      int dimension = atLeast(10);
+      for (int i = 0; i < numDoc; i++) {
+        int id = random().nextInt(numDoc);
+        float[] value;
+        if (random().nextInt(7) != 3) {
+          // usually index a vector value for a doc
+          value = randomVector(dimension);
+        } else {
+          value = null;
+        }
+        add(iw, fieldName, id, value, VectorSimilarityFunction.EUCLIDEAN);
+      }
+      iw.forceMerge(1);
+
+      // randomly delete some documents
+      for (int i = 0; i < 30; i++) {
+        int idToDelete = random().nextInt(numDoc);
+        iw.deleteDocuments(new Term("id", Integer.toString(idToDelete)));
+      }
+
+      try (IndexReader reader = DirectoryReader.open(iw)) {
+        for (LeafReaderContext ctx : reader.leaves()) {
+          Bits liveDocs = ctx.reader().getLiveDocs();
+          VectorValues vectorValues = ctx.reader().getVectorValues(fieldName);
+          if (vectorValues == null) {
+            continue;
+          }
+
+          // check the limit is hit when it's very small
+          int k = 5 + random().nextInt(45);
+          int visitedLimit = k + random().nextInt(5);
+          TopDocs results =
+              ctx.reader()
+                  .searchNearestVectors(
+                      fieldName, randomVector(dimension), k, liveDocs, visitedLimit);
+          assertEquals(TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO, results.totalHits.relation);

Review comment:
       Would this be actually guaranteed for every KNN vectors format? It feels hard to identify nearest neighbors while visitedLimit is so close to k, but maybe some edge cases like completely disconnected graphs could trigger rare test failures?

##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java
##########
@@ -186,6 +188,7 @@ private NeighborQueue searchLevel(
             }
           }
         }
+        numVisited++;

Review comment:
       nit: I liked having `numVisited` next to `similarityFunction.compare` better since my mental model is that the point of `numVisited` is to compute the number of similarity computations that are performed to find the top-k nearest neighbors to the query vector.

##########
File path: lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java
##########
@@ -155,18 +159,13 @@ private NeighborQueue searchLevel(
     if (results.size() >= topK) {
       bound.set(results.topScore());
     }
-    while (candidates.size() > 0) {
+    while (candidates.size() > 0 && results.incomplete() == false) {

Review comment:
       Maybe extract this to an outer `if` statement so that we only check it before entering the while loop. Once we are in the while loop, `results.incomplete()` is guaranteed to be `false` since we always break the loop after calling `markIncomplete()`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org