You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/09/18 18:54:28 UTC

[GitHub] [lucene] msokolov opened a new pull request, #11781: Diversity check bugfix

msokolov opened a new pull request, #11781:
URL: https://github.com/apache/lucene/pull/11781

   I was looking into the changes in recall we have been observing in various test cases. Thanks @jtibshirani for pointing out we should not see any change at all! I found a couple of related bugs in the diversity-checking code that I had introduced when refactoring as part of splitting into float[]- and byte[]- oriented versions. I found some gaps in our test coverage and added tests. One of these demonstrates the bug; the other is just filling in another case.
   
   With this fix I was able to measure the identical recall comparing against test runs with 9.3


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #11781: Diversity check bugfix

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on code in PR #11781:
URL: https://github.com/apache/lucene/pull/11781#discussion_r974364724


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/TestHnswGraph.java:
##########
@@ -555,6 +556,78 @@ public void testDiversity() throws IOException {
     assertLevel0Neighbors(builder.hnsw, 5, 1, 4);
   }
 
+  public void testDiversityFallback() throws IOException {
+    vectorEncoding = randomVectorEncoding();
+    similarityFunction = VectorSimilarityFunction.EUCLIDEAN;
+    // Some test cases can't be exercised in two dimensions;
+    // in particular if a new neighbor displaces an existing neighbor
+    // by being closer to the target, yet none of the existing neighbors is closer to the new vector
+    // than to the target -- ie they all remain diverse, so we simply drop the farthest one.
+    float[][] values = {
+      {0, 0, 0},
+      {0, 1, 0},
+      {0, 0, 2},
+      {1, 0, 0},
+      {0, 0.4f, 0}
+    };
+    MockVectorValues vectors = new MockVectorValues(values);
+    // First add nodes until everybody gets a full neighbor list
+    HnswGraphBuilder<?> builder =
+        HnswGraphBuilder.create(
+            vectors, vectorEncoding, similarityFunction, 1, 10, random().nextInt());
+    // node 0 is added by the builder constructor
+    // builder.addGraphNode(vectors.vectorValue(0));
+    RandomAccessVectorValues vectorsCopy = vectors.copy();
+    builder.addGraphNode(1, vectorsCopy);
+    builder.addGraphNode(2, vectorsCopy);
+    assertLevel0Neighbors(builder.hnsw, 0, 1, 2);
+    // 2 is closer to 0 than 1, so it is excluded as non-diverse
+    assertLevel0Neighbors(builder.hnsw, 1, 0);
+    // 1 is closer to 0 than 2, so it is excluded as non-diverse
+    assertLevel0Neighbors(builder.hnsw, 2, 0);
+
+    builder.addGraphNode(3, vectorsCopy);
+    // this is one case we are testing; 2 has been displaced by 3

Review Comment:
   nice test!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #11781: Diversity check bugfix

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on code in PR #11781:
URL: https://github.com/apache/lucene/pull/11781#discussion_r974364476


##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##########
@@ -316,49 +316,49 @@ private boolean isDiverse(BytesRef candidate, NeighborArray neighbors, float sco
    */
   private int findWorstNonDiverse(NeighborArray neighbors) throws IOException {
     for (int i = neighbors.size() - 1; i > 0; i--) {
-      if (isWorstNonDiverse(i, neighbors, neighbors.score[i])) {
+      if (isWorstNonDiverse(i, neighbors)) {
         return i;
       }
     }
     return neighbors.size() - 1;
   }
 
-  private boolean isWorstNonDiverse(
-      int candidate, NeighborArray neighbors, float minAcceptedSimilarity) throws IOException {
+  private boolean isWorstNonDiverse(int candidateIndex, NeighborArray neighbors)
+      throws IOException {
+    int candidateNode = neighbors.node[candidateIndex];
     return switch (vectorEncoding) {
-      case BYTE -> isWorstNonDiverse(
-          candidate, vectors.binaryValue(candidate), neighbors, minAcceptedSimilarity);
+      case BYTE -> isWorstNonDiverse(candidateIndex, vectors.binaryValue(candidateNode), neighbors);
       case FLOAT32 -> isWorstNonDiverse(
-          candidate, vectors.vectorValue(candidate), neighbors, minAcceptedSimilarity);
+          candidateIndex, vectors.vectorValue(candidateNode), neighbors);
     };
   }
 
   private boolean isWorstNonDiverse(
-      int candidateIndex, float[] candidate, NeighborArray neighbors, float minAcceptedSimilarity)
-      throws IOException {
-    for (int i = candidateIndex - 1; i > -0; i--) {
+      int candidateIndex, float[] candidateVector, NeighborArray neighbors) throws IOException {
+    float minAcceptedSimilarity = neighbors.score[candidateIndex];
+    for (int i = candidateIndex - 1; i >= 0; i--) {
       float neighborSimilarity =
-          similarityFunction.compare(candidate, vectorsCopy.vectorValue(neighbors.node[i]));
-      // node i is too similar to node j given its score relative to the base node
+          similarityFunction.compare(candidateVector, vectorsCopy.vectorValue(neighbors.node[i]));
+      // candidate node is too similar to node i given its score relative to the base node
       if (neighborSimilarity >= minAcceptedSimilarity) {
-        return false;
+        return true;
       }
     }
-    return true;
+    return false;
   }
 
   private boolean isWorstNonDiverse(
-      int candidateIndex, BytesRef candidate, NeighborArray neighbors, float minAcceptedSimilarity)
-      throws IOException {
-    for (int i = candidateIndex - 1; i > -0; i--) {
+      int candidateIndex, BytesRef candidateVector, NeighborArray neighbors) throws IOException {

Review Comment:
   I am surprised that with this big change, we had only a small reduction in recall. I guess the reason could be that in our tests diversity check was really relevant only for small number of nodes; in majority of cases the algorithm just eliminated the most distant node.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] msokolov merged pull request #11781: Diversity check bugfix

Posted by GitBox <gi...@apache.org>.
msokolov merged PR #11781:
URL: https://github.com/apache/lucene/pull/11781


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mayya-sharipova commented on pull request #11781: Diversity check bugfix

Posted by GitBox <gi...@apache.org>.
mayya-sharipova commented on PR #11781:
URL: https://github.com/apache/lucene/pull/11781#issuecomment-1251139127

   @msokolov Thanks for tacking this.
   
   I ran ann benchmarks with this change, and happy to confirm that in my test recall with this PR is the same as in 9.3 branch, although QPS is lower, but we can investigate QPSs later.
   
   
   
   
   **glove-100-angular M:16 efConstruction:100**
   
   |             | 9.3 recall |  9.3 QPS | this PR recall | this PR QPS |
   | ----------- | ---------: | -------: | -------------: | ----------: |
   | n_cands=10  |      0.620 | 2745.933 |          0.620 |    1675.500 |
   | n_cands=20  |      0.680 | 2288.665 |          0.680 |    1512.744 |
   | n_cands=40  |      0.746 | 1770.243 |          0.746 |    1040.240 |
   | n_cands=80  |      0.809 | 1226.738 |          0.809 |     695.236 |
   | n_cands=120 |      0.843 |  948.908 |          0.843 |     525.914 |
   | n_cands=200 |      0.878 |  671.781 |          0.878 |     351.529 |
   | n_cands=400 |      0.918 |  392.265 |          0.918 |     207.854 |
   | n_cands=600 |      0.937 |  282.403 |          0.937 |     144.311 |
   | n_cands=800 |      0.949 |  214.620 |          0.949 |     116.875 |


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] msokolov commented on a diff in pull request #11781: Diversity check bugfix

Posted by GitBox <gi...@apache.org>.
msokolov commented on code in PR #11781:
URL: https://github.com/apache/lucene/pull/11781#discussion_r974411586


##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##########
@@ -316,49 +316,49 @@ private boolean isDiverse(BytesRef candidate, NeighborArray neighbors, float sco
    */
   private int findWorstNonDiverse(NeighborArray neighbors) throws IOException {
     for (int i = neighbors.size() - 1; i > 0; i--) {
-      if (isWorstNonDiverse(i, neighbors, neighbors.score[i])) {
+      if (isWorstNonDiverse(i, neighbors)) {
         return i;
       }
     }
     return neighbors.size() - 1;
   }
 
-  private boolean isWorstNonDiverse(
-      int candidate, NeighborArray neighbors, float minAcceptedSimilarity) throws IOException {
+  private boolean isWorstNonDiverse(int candidateIndex, NeighborArray neighbors)
+      throws IOException {
+    int candidateNode = neighbors.node[candidateIndex];
     return switch (vectorEncoding) {
-      case BYTE -> isWorstNonDiverse(
-          candidate, vectors.binaryValue(candidate), neighbors, minAcceptedSimilarity);
+      case BYTE -> isWorstNonDiverse(candidateIndex, vectors.binaryValue(candidateNode), neighbors);
       case FLOAT32 -> isWorstNonDiverse(
-          candidate, vectors.vectorValue(candidate), neighbors, minAcceptedSimilarity);
+          candidateIndex, vectors.vectorValue(candidateNode), neighbors);
     };
   }
 
   private boolean isWorstNonDiverse(
-      int candidateIndex, float[] candidate, NeighborArray neighbors, float minAcceptedSimilarity)
-      throws IOException {
-    for (int i = candidateIndex - 1; i > -0; i--) {
+      int candidateIndex, float[] candidateVector, NeighborArray neighbors) throws IOException {
+    float minAcceptedSimilarity = neighbors.score[candidateIndex];
+    for (int i = candidateIndex - 1; i >= 0; i--) {
       float neighborSimilarity =
-          similarityFunction.compare(candidate, vectorsCopy.vectorValue(neighbors.node[i]));
-      // node i is too similar to node j given its score relative to the base node
+          similarityFunction.compare(candidateVector, vectorsCopy.vectorValue(neighbors.node[i]));
+      // candidate node is too similar to node i given its score relative to the base node
       if (neighborSimilarity >= minAcceptedSimilarity) {
-        return false;
+        return true;
       }
     }
-    return true;
+    return false;
   }
 
   private boolean isWorstNonDiverse(
-      int candidateIndex, BytesRef candidate, NeighborArray neighbors, float minAcceptedSimilarity)
-      throws IOException {
-    for (int i = candidateIndex - 1; i > -0; i--) {
+      int candidateIndex, BytesRef candidateVector, NeighborArray neighbors) throws IOException {

Review Comment:
   I know - how did this garbage even work at all! :frowning_face: It's kind of astonishing how insensitive this whole process is to the diversity checking. Initially we didn't have it at all though (just always pick the closest neighbors), and things still kind of work. Then I had the wonky implementation that did not sort the neighbors while indexing, but did some best effort kind of thing, and still it mostly worked. So we need good tests here to ensure we are doing the right thing! Because bugs here can lead to small degradation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] msokolov commented on a diff in pull request #11781: Diversity check bugfix

Posted by GitBox <gi...@apache.org>.
msokolov commented on code in PR #11781:
URL: https://github.com/apache/lucene/pull/11781#discussion_r973782739


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/TestHnswGraph.java:
##########
@@ -555,6 +556,78 @@ public void testDiversity() throws IOException {
     assertLevel0Neighbors(builder.hnsw, 5, 1, 4);
   }
 
+  public void testDiversityFallback() throws IOException {
+    vectorEncoding = randomVectorEncoding();
+    similarityFunction = VectorSimilarityFunction.EUCLIDEAN;
+    // Some test cases can't be exercised in two dimensions;
+    // in particular if a new neighbor displaces an existing neighbor
+    // by being closer to the target, yet none of the existing neighbors is closer to the new vector
+    // than to the target -- ie they all remain diverse, so we simply drop the farthest one.
+    float[][] values = {
+      {0, 0, 0},
+      {0, 1, 0},
+      {0, 0, 2},
+      {1, 0, 0},
+      {0, 0.4f, 0}

Review Comment:
   hm, I guess this works for bytes too, but we should probably multiply everything here to make it non-fractional



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #11781: Diversity check bugfix

Posted by GitBox <gi...@apache.org>.
jpountz commented on PR #11781:
URL: https://github.com/apache/lucene/pull/11781#issuecomment-1252026867

   It looks like some recent test failures affecting the 9.4 branch are caused by this change, e.g. https://ci-builds.apache.org/job/Lucene/job/Lucene-NightlyTests-9.4/18/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #11781: Diversity check bugfix

Posted by GitBox <gi...@apache.org>.
jpountz commented on PR #11781:
URL: https://github.com/apache/lucene/pull/11781#issuecomment-1252034476

   Apologies, I just noticed that you opened https://github.com/apache/lucene/issues/11787.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org