You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/05/27 20:09:37 UTC
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #926: VectorSimilarityFunction reverse removal

mayya-sharipova commented on code in PR #926:
URL: https://github.com/apache/lucene/pull/926#discussion_r883911186


##########
lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java:
##########
@@ -47,63 +42,30 @@ public float convertToScore(float similarity) {
   DOT_PRODUCT {
     @Override
     public float compare(float[] v1, float[] v2) {
-      return dotProduct(v1, v2);
-    }
-
-    @Override
-    public float convertToScore(float similarity) {
-      return (1 + similarity) / 2;
+      return (1 + dotProduct(v1, v2)) / 2;
     }
   },
 
   /**
    * Cosine similarity. NOTE: the preferred way to perform cosine similarity is to normalize all
    * vectors to unit length, and instead use {@link VectorSimilarityFunction#DOT_PRODUCT}. You
    * should only use this function if you need to preserve the original vectors and cannot normalize
-   * them in advance.
+   * them in advance. The similarity score is normalised to assure it is positive.
    */
   COSINE {
     @Override
     public float compare(float[] v1, float[] v2) {
-      return cosine(v1, v2);
-    }
-
-    @Override
-    public float convertToScore(float similarity) {
-      return (1 + similarity) / 2;
+      return (1 + cosine(v1, v2)) / 2;
     }
   };
 
   /**
-   * If true, the scores associated with vector comparisons are nonnegative and in reverse order;
-   * that is, lower scores represent more similar vectors. Otherwise, if false, higher scores
-   * represent more similar vectors, and scores may be negative or positive.
-   */
-  public final boolean reversed;
-
-  VectorSimilarityFunction(boolean reversed) {
-    this.reversed = reversed;
-  }
-
-  VectorSimilarityFunction() {
-    reversed = false;
-  }
-
-  /**
-   * Calculates a similarity score between the two vectors with a specified function.
+   * Calculates a similarity score between the two vectors with a specified function. Higher
+   * similarity scores correspond to closer vectors.
    *
    * @param v1 a vector
    * @param v2 another vector, of the same dimension
    * @return the value of the similarity function applied to the two vectors
    */
   public abstract float compare(float[] v1, float[] v2);

Review Comment:
   may be instead of `compare` we should call this function `score`?



##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##########
@@ -264,13 +263,13 @@ private int findWorstNonDiverse(NeighborArray neighbors) throws IOException {
     for (int i = neighbors.size() - 1; i > 0; i--) {
       int cNode = neighbors.node[i];
       float[] cVector = vectorValues.vectorValue(cNode);
-      bound.set(neighbors.score[i]);
+      minAcceptedSimilarity = neighbors.score[i];

Review Comment:
   Similarly here, we don't need to use the global variable of `minAcceptedSimilarity`, it should be local for this function.



##########
lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborQueue.java:
##########
@@ -56,9 +56,9 @@ long apply(long v) {
   // Whether the search stopped early because it reached the visited nodes limit
   private boolean incomplete;
 
-  public NeighborQueue(int initialSize, boolean reversed) {
+  public NeighborQueue(int initialSize, boolean descOrder) {

Review Comment:
   I am also in favour of calling this parameter "maxHeap", first it consistent with other variables in the class, and for me personally it is easier to think if I want max_heap or min_heap for the current case.



##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java:
##########
@@ -155,14 +155,14 @@ private NeighborQueue searchLevel(
 
     // A bound that holds the minimum similarity to the query vector that a candidate vector must
     // have to be considered.
-    BoundsChecker bound = BoundsChecker.create(similarityFunction.reversed);
+    float minAcceptedSimilarity = Float.NEGATIVE_INFINITY;
     if (results.size() >= topK) {
-      bound.set(results.topScore());
+      minAcceptedSimilarity = results.topNodeScore();
     }
     while (candidates.size() > 0 && results.incomplete() == false) {
       // get the best candidate (closest or best scoring)
-      float topCandidateScore = candidates.topScore();
-      if (bound.check(topCandidateScore)) {
+      float topCandidateSimilarity = candidates.topNodeScore();

Review Comment:
   Now with your changes, these are really scores, so I think it is probably worth to keep old names such as `topCandidateScore`



##########
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswGraphBuilder.java:
##########
@@ -51,7 +50,7 @@ public final class Lucene90HnswGraphBuilder {
   private final VectorSimilarityFunction similarityFunction;
   private final RandomAccessVectorValues vectorValues;
   private final SplittableRandom random;
-  private final BoundsChecker bound;
+  private float minAcceptedSimilarity = Float.POSITIVE_INFINITY;

Review Comment:
   I think our idea for old codecs (this needs to be confirmed by other members) is to never modify them.  If you delete a certain class., e.g. `BoundsChecker`, you should bring it to `lucene90` package as `Lucene90BoundChecker` and use it in Lucene90codecs.



##########
lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java:
##########
@@ -193,25 +208,40 @@ public void testAdvanceShallow() throws IOException {
       }
       try (IndexReader reader = DirectoryReader.open(d)) {
         IndexSearcher searcher = new IndexSearcher(reader);
-        KnnVectorQuery query = new KnnVectorQuery("field", new float[] {2, 3}, 3);
+        KnnVectorQuery query = new KnnVectorQuery("field", new float[] {0.5f, 1}, 3);

Review Comment:
   What's the reason  why query and assertions were modified?



##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##########
@@ -245,11 +244,11 @@ private boolean diversityCheck(
       NeighborArray neighbors,
       RandomAccessVectorValues vectorValues)
       throws IOException {
-    bound.set(score);
+    minAcceptedSimilarity = score;

Review Comment:
   Instead of using  the global  variable of `minAcceptedSimilarity` variable, we can use use `score` parameter here, perhaps renaming this parameter to `candidateScore` or `candidateSimilarity`



##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##########
@@ -264,13 +263,13 @@ private int findWorstNonDiverse(NeighborArray neighbors) throws IOException {
     for (int i = neighbors.size() - 1; i > 0; i--) {
       int cNode = neighbors.node[i];
       float[] cVector = vectorValues.vectorValue(cNode);
-      bound.set(neighbors.score[i]);
+      minAcceptedSimilarity = neighbors.score[i];
       // check the candidate against its better-scoring neighbors
       for (int j = i - 1; j >= 0; j--) {
-        float diversityCheck =
+        float buildVectorSimilarity =

Review Comment:
   instead of `buildVectorSimilarity` a better name would be `neighborSimilarity`. `buildVectors` is vectors are just vectors that allow us to get random access.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org