You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/05/27 13:47:36 UTC

[GitHub] [lucene] msokolov commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

msokolov commented on code in PR #926:
URL: https://github.com/apache/lucene/pull/926#discussion_r883626055


##########
lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java:
##########
@@ -47,63 +42,30 @@ public float convertToScore(float similarity) {
   DOT_PRODUCT {
     @Override
     public float compare(float[] v1, float[] v2) {
-      return dotProduct(v1, v2);
-    }
-
-    @Override
-    public float convertToScore(float similarity) {
-      return (1 + similarity) / 2;
+      return (1 + dotProduct(v1, v2)) / 2;
     }
   },
 
   /**
    * Cosine similarity. NOTE: the preferred way to perform cosine similarity is to normalize all
    * vectors to unit length, and instead use {@link VectorSimilarityFunction#DOT_PRODUCT}. You
    * should only use this function if you need to preserve the original vectors and cannot normalize
-   * them in advance.
+   * them in advance. The similarity score is normalised to assure it is positive.
    */
   COSINE {
     @Override
     public float compare(float[] v1, float[] v2) {
-      return cosine(v1, v2);
-    }
-
-    @Override
-    public float convertToScore(float similarity) {
-      return (1 + similarity) / 2;
+      return (1 + cosine(v1, v2)) / 2;
     }
   };
 
   /**
-   * If true, the scores associated with vector comparisons are nonnegative and in reverse order;
-   * that is, lower scores represent more similar vectors. Otherwise, if false, higher scores
-   * represent more similar vectors, and scores may be negative or positive.
-   */
-  public final boolean reversed;
-
-  VectorSimilarityFunction(boolean reversed) {
-    this.reversed = reversed;
-  }
-
-  VectorSimilarityFunction() {
-    reversed = false;
-  }
-
-  /**
-   * Calculates a similarity score between the two vectors with a specified function.
+   * Calculates a similarity score between the two vectors with a specified function. Highest
+   * similarity score, means always closer vectors.

Review Comment:
   grammar nit: "Higher similarity scores correspond to closer vectors" seems more idiomatic to me



##########
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborQueue.java:
##########
@@ -56,9 +56,9 @@ long apply(long v) {
   // Whether the search stopped early because it reached the visited nodes limit
   private boolean incomplete;
 
-  public NeighborQueue(int initialSize, boolean reversed) {
+  public NeighborQueue(int initialSize, boolean descOrder) {

Review Comment:
   for consistency, perhaps call the parameter `maxHeap`?



##########
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborQueue.java:
##########
@@ -109,12 +117,15 @@ public int[] nodes() {
 
   /** Returns the top element's node id. */
   public int topNode() {
-    return (int) order.apply(heap.top());
+    return decodeNodeId(heap.top());
   }
 
-  /** Returns the top element's node score. */
-  public float topScore() {
-    return NumericUtils.sortableIntToFloat((int) (order.apply(heap.top()) >> 32));
+  /**
+   * Returns the top element's node score. For the min heap this is the minimum score. For the max

Review Comment:
   Is there a reason for renaming this method? It seems unneccessary. We could just change the javadoc to say "returns the top element's score"?



##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##########
@@ -264,13 +263,13 @@ private int findWorstNonDiverse(NeighborArray neighbors) throws IOException {
     for (int i = neighbors.size() - 1; i > 0; i--) {
       int cNode = neighbors.node[i];
       float[] cVector = vectorValues.vectorValue(cNode);
-      bound.set(neighbors.score[i]);
+      minAcceptedSimilarity = neighbors.score[i];
       // check the candidate against its better-scoring neighbors
       for (int j = i - 1; j >= 0; j--) {
-        float diversityCheck =
+        float buildVectoreSimilarity =

Review Comment:
   typo - "Vectore" should read "Vector"



##########
lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java:
##########
@@ -193,25 +204,36 @@ public void testAdvanceShallow() throws IOException {
       }
       try (IndexReader reader = DirectoryReader.open(d)) {
         IndexSearcher searcher = new IndexSearcher(reader);
-        KnnVectorQuery query = new KnnVectorQuery("field", new float[] {2, 3}, 3);
+        KnnVectorQuery query = new KnnVectorQuery("field", new float[] {0.5f, 1}, 3);
         Query dasq = query.rewrite(reader);
         Scorer scorer =
             dasq.createWeight(searcher, ScoreMode.COMPLETE, 1).scorer(reader.leaves().get(0));
         // before advancing the iterator
-        assertEquals(1, scorer.advanceShallow(0));
+        assertEquals(0, scorer.advanceShallow(0));
         assertEquals(1, scorer.advanceShallow(1));
         assertEquals(NO_MORE_DOCS, scorer.advanceShallow(10));
 
         // after advancing the iterator
         scorer.iterator().advance(2);
         assertEquals(2, scorer.advanceShallow(0));
+        assertEquals(2, scorer.advanceShallow(1));
         assertEquals(2, scorer.advanceShallow(2));
-        assertEquals(3, scorer.advanceShallow(3));
         assertEquals(NO_MORE_DOCS, scorer.advanceShallow(10));
       }
     }
   }
 
+  /**
+   * Query = (0.5, 1)
+   * Doc0 = (0, 0) 1 / (l2distance + 1) from query = 0.444
+   * Doc1 = (1, 1) 1 / (l2distance + 1) from query = 0.8
+   * Doc2 = (2, 2) 1 / (l2distance + 1) from query = 0.235
+   * Doc3 = (3, 3) 1 / (l2distance + 1) from query = 0.089
+   * Doc4 = (4, 4) 1 / (l2distance + 1) from query = 0.045
+   * 
+   * The expected TOP 3 = [Doc1, Doc0, Doc2]
+   * @throws IOException
+   */

Review Comment:
   I think having multiple documents with the same distance is a feature of the test since it requires us to ensure that tiebreaking by docid is working properly. Was there a reason why you needed to change it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org