You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/06/07 19:39:33 UTC

[GitHub] [lucene] msokolov opened a new pull request, #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

msokolov opened a new pull request, #947:
URL: https://github.com/apache/lucene/pull/947

   This is PR #3 for this feature. It is very close to the previous one, just "rebased" on top of the Lucene93 Codec. In this PR I moved the new vector utility methods as package private in util.hnsw so they would be easier to change in the future. I did not attempt any loop-unrolling optimizations. I have tried some incubating vector api implementations, but nothing is ready to share. I re-ran luceneutil tests, and will open a separate PR for adding support for this format to luceneutil. Results continue to look promising
   
     Task QPS   |   baseline |     StdDev | candidate  |      StdDev |                Pct diff | p-value
     --------------- |   -------: | ---------- | ---------: | ----------- | ----------------------- | ---------:
   PKLookup|      134.03|     (19.3%)| 135.28     |      (15.9%)|    0.9% ( -28% -   44%) |0.868
   LowTermVector|      637.48|      (9.0%)| 695.72     |      (10.7%)|    9.1% (  -9% -   31%) |0.003
   AndHighMedVector|      445.66|      (8.0%)| 559.35     |      (11.8%)|   25.5% (   5% -   49%) |0.000
   HighTermVector|      578.16|      (9.5%)| 737.93     |      (16.2%)|   27.6% (   1% -   59%) |0.000
   AndHighHighVector|      418.25|      (9.8%)| 555.38     |      (15.2%)|   32.8% (   7% -   64%) |0.000
   MedTermVector|      383.02|      (8.4%)| 550.12     |      (13.5%)|   43.6% (  20% -   71%) |0.000
   AndHighLowVector|      450.92|      (9.5%)| 713.77     |      (20.1%)|   58.3% (  26% -   97%) |0.000
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on PR #947:
URL: https://github.com/apache/lucene/pull/947#issuecomment-1156532534

   @msokolov Thanks for your work, this is a very exciting feature.  Really looking forward to it. Extra benefit – less disk space used for vector values.
   
   Overall, this PR looks very good to me. I just have several questions
   - During merging when writing a merged vector field it looks like we first expand vector values only to again to compress them later? Would be nice to avoid this.
   - Not sure if possible at this stage, but it would be nice if `HnswGraphBuilder` and `HnswGraphSearcher` were not aware of different calculations needed for different similarity functions, and refer all this calculations (dotProduct(BytesRef a..)) to `VectorSimilarityFunction`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

Posted by GitBox <gi...@apache.org>.

rmuir commented on PR #947:
URL: https://github.com/apache/lucene/pull/947#issuecomment-1185833705

   I think the title of the PR is wrong? We shouldn't be quantizing anything. The user should be supplying a `byte[]` vector for 8-bit vectors. Floats should not be involved.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] msokolov commented on a diff in pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

Posted by GitBox <gi...@apache.org>.

msokolov commented on code in PR #947:
URL: https://github.com/apache/lucene/pull/947#discussion_r891642180


##########
lucene/backward-codecs/src/test/org/apache/lucene/backward_codecs/lucene92/Lucene92HnswVectorsWriter.java:
##########
@@ -129,6 +128,12 @@ public void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader)
     try {
       // write the vector data to a temporary file
       DocsWithFieldSet docsWithField = writeVectorData(tempVectorData, vectors);
+      int byteSize;

Review Comment:
   oops some left-over change that came in while rebasing; I'll remove



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] msokolov commented on pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

Posted by GitBox <gi...@apache.org>.

msokolov commented on PR #947:
URL: https://github.com/apache/lucene/pull/947#issuecomment-1150043099

   If folks have time to review, that would be great. The main thing to focus on I think is 
   
   1. how the HnswGraphSearcher/Writer implementation is forked into `float[]` and `BytesRef` using generics
   2. adding a VERSION to the index format so we can handle back-compat in a reasonable way
   3. the fact that some of the details of VectorSimilarityFunction are now directly handled in util.hnsw -- I didn't see a better way, but it blurs the abstraction boundary


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on code in PR #947:
URL: https://github.com/apache/lucene/pull/947#discussion_r897925136


##########
lucene/core/src/java/org/apache/lucene/codecs/lucene93/OffHeapVectorValues.java:
##########
@@ -41,11 +42,11 @@ abstract class OffHeapVectorValues extends VectorValues
   protected final int byteSize;
   protected final float[] value;
 
-  OffHeapVectorValues(int dimension, int size, IndexInput slice) {
+  OffHeapVectorValues(int dimension, int size, IndexInput slice, int byteSize) {

Review Comment:
   For DOT_PRODUCT8 case for  `vectorValue(..)` functions should we use `slice.readBytes` instead of `slice.ReadFloats` and then convert them to floats? Otherwise, it looks to me we are reading wrong values.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] msokolov commented on pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

Posted by GitBox <gi...@apache.org>.

msokolov commented on PR #947:
URL: https://github.com/apache/lucene/pull/947#issuecomment-1185649145

   I pushed an updated luceneutil PR adapting to these changes https://github.com/mikemccand/luceneutil/pull/181. Running that perf test I saw consistent gains (20-55% depending on the test case) as compared to the earlier test runs.
   
   I also noticed that the profiler shows the most expensive function during indexing is FixedBitSet.clear(), which makes me think we mioght want to use sparse bitsets for the "upper" layers of the graph which have many fewer nodes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on code in PR #947:
URL: https://github.com/apache/lucene/pull/947#discussion_r923800118


##########
lucene/core/src/java/org/apache/lucene/util/VectorUtil.java:
##########
@@ -213,4 +213,38 @@ public static void add(float[] u, float[] v) {
       u[i] += v[i];
     }
   }
+
+  /**
+   * Dot product score computed over signed bytes, scaled to be in [0, 1].
+   *
+   * @param a bytes containing a vector
+   * @param aOffset offset of the vector in a
+   * @param b bytes containing another vector, of the same dimension
+   * @param len the length (aka dimension) of the vectors
+   * @param bOffset offset of the vector in b
+   * @return the value of the similarity function applied to the two vectors
+   */
+  public static float dotProductScore(BytesRef a, int aOffset, BytesRef b, int bOffset, int len) {
+    int total = 0;
+    for (int i = 0; i < len; i++) {
+      total += a.bytes[aOffset++] * b.bytes[bOffset++];
+    }
+    // divide by 2 * 2^14 (maximum absolute value of product of 2 signed bytes) * len
+    return (1 + total) / (float) (len * (1 << 15));

Review Comment:
   To make scores non-negative should we do instead:
   `total / (float) (len * (1 << 15))  + 1`?
   
   I am also wondering why in the comments we say `// divide by 2 * 2^14`, but in the calculation we use `1 << 15`? Should it be `1 << 14` instead?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] msokolov commented on pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

Posted by GitBox <gi...@apache.org>.

msokolov commented on PR #947:
URL: https://github.com/apache/lucene/pull/947#issuecomment-1180963745

   I'm looking to address various comments; just pushed a commit that makes the vector encoding explicit by adding a new enum and parameter "vectorEncoding", splitting this out from "similarityFunction".
   
   > During merging when writing a merged vector field it looks like we first expand vector values only to again to compress them later? Would be nice to avoid this.
   
   Oh good catch, @mayya-sharipova I will look into addressing this.
   
   > Not sure if possible at this stage, but it would be nice if HnswGraphBuilder and HnswGraphSearcher were not aware of different calculations needed for different similarity functions, and refer all this calculations (dotProduct(BytesRef a..)) to VectorSimilarityFunction
   
   I don't see how to do this efficiently (without many conversions from byte to float) and neatly (without code duplication in tricky algorithmic areas) and with complete API purity, so I sacrificed some purity. If you have any ideas how to do it better, I'm open to changing it though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] msokolov commented on pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

Posted by GitBox <gi...@apache.org>.

msokolov commented on PR #947:
URL: https://github.com/apache/lucene/pull/947#issuecomment-1182694202

   OK, this last round of commits moves the new vector encoding parameter out of IndexableField and FieldInfo into Codec constructor and internally to the codec, in FieldEntry. It certainly has less visible surface area now. I also merged from main and resolved a bunch of conflicts with the scoring change. I think it is correct (all the unit tests pass), but it wasn't trivial and I think it would be worth running some integration/performance tests just to make sure all is still well.
   
   There's a little bit of code duplication in HnswGraphSearcher where we now have the logic for switching from approximate to exact knn in two places that I don't like. Maybe that can be factored better? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] msokolov closed pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

Posted by GitBox <gi...@apache.org>.

msokolov closed pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits
URL: https://github.com/apache/lucene/pull/947


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on code in PR #947:
URL: https://github.com/apache/lucene/pull/947#discussion_r923800118


##########
lucene/core/src/java/org/apache/lucene/util/VectorUtil.java:
##########
@@ -213,4 +213,38 @@ public static void add(float[] u, float[] v) {
       u[i] += v[i];
     }
   }
+
+  /**
+   * Dot product score computed over signed bytes, scaled to be in [0, 1].
+   *
+   * @param a bytes containing a vector
+   * @param aOffset offset of the vector in a
+   * @param b bytes containing another vector, of the same dimension
+   * @param len the length (aka dimension) of the vectors
+   * @param bOffset offset of the vector in b
+   * @return the value of the similarity function applied to the two vectors
+   */
+  public static float dotProductScore(BytesRef a, int aOffset, BytesRef b, int bOffset, int len) {
+    int total = 0;
+    for (int i = 0; i < len; i++) {
+      total += a.bytes[aOffset++] * b.bytes[bOffset++];
+    }
+    // divide by 2 * 2^14 (maximum absolute value of product of 2 signed bytes) * len
+    return (1 + total) / (float) (len * (1 << 15));

Review Comment:
   To make scores non-negative should we do instead:
   `total / (float) (len * (1 << 15))  + 1`?
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on code in PR #947:
URL: https://github.com/apache/lucene/pull/947#discussion_r923807979


##########
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsReader.java:
##########
@@ -89,4 +123,67 @@ public abstract TopDocs search(
   public KnnVectorsReader getMergeInstance() {
     return this;
   }
+
+  /** {@link #searchExhaustively} */
+  protected static TopDocs exhaustiveSearch(
+      VectorValues vectorValues,
+      DocIdSetIterator acceptDocs,
+      VectorSimilarityFunction similarityFunction,
+      float[] target,
+      int k)
+      throws IOException {
+    HitQueue queue = new HitQueue(k, true);
+    ScoreDoc topDoc = queue.top();
+    int doc;
+    while ((doc = acceptDocs.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
+      int vectorDoc = vectorValues.advance(doc);
+      assert vectorDoc == doc;
+      float score = similarityFunction.compare(vectorValues.vectorValue(), target);
+      if (score >= topDoc.score) {
+        topDoc.score = score;
+        topDoc.doc = doc;
+        topDoc = queue.updateTop();
+      }
+    }
+    return topDocsFromHitQueue(queue, acceptDocs.cost());
+  }
+
+  /** {@link #searchExhaustively} */
+  protected static TopDocs exhaustiveSearch(
+      VectorValues vectorValues,
+      DocIdSetIterator acceptDocs,
+      VectorSimilarityFunction similarityFunction,

Review Comment:
   looks like `similarityFunction` is not necessary here, as we always use `dotProductScore`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on code in PR #947:
URL: https://github.com/apache/lucene/pull/947#discussion_r924545508


##########
lucene/core/src/java/org/apache/lucene/util/VectorUtil.java:
##########
@@ -213,4 +213,38 @@ public static void add(float[] u, float[] v) {
       u[i] += v[i];
     }
   }
+
+  /**
+   * Dot product score computed over signed bytes, scaled to be in [0, 1].
+   *
+   * @param a bytes containing a vector
+   * @param aOffset offset of the vector in a
+   * @param b bytes containing another vector, of the same dimension
+   * @param len the length (aka dimension) of the vectors
+   * @param bOffset offset of the vector in b
+   * @return the value of the similarity function applied to the two vectors
+   */
+  public static float dotProductScore(BytesRef a, int aOffset, BytesRef b, int bOffset, int len) {
+    int total = 0;
+    for (int i = 0; i < len; i++) {
+      total += a.bytes[aOffset++] * b.bytes[bOffset++];
+    }
+    // divide by 2 * 2^14 (maximum absolute value of product of 2 signed bytes) * len
+    return (1 + total) / (float) (len * (1 << 15));

Review Comment:
   Should the maximum absolute value of product of 2 signed bytes be  `2^14 ` instead of `2 * 2^14`? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on code in PR #947:
URL: https://github.com/apache/lucene/pull/947#discussion_r923672692


##########
lucene/core/src/java/org/apache/lucene/codecs/lucene93/Lucene93HnswVectorsReader.java:
##########
@@ -169,7 +173,12 @@ private void validateFieldEntry(FieldInfo info, FieldEntry fieldEntry) {
               + fieldEntry.dimension);
     }
 
-    long numBytes = (long) fieldEntry.size() * dimension * Float.BYTES;
+    long numBytes;
+    switch (fieldEntry.vectorEncoding) {
+      case BYTE -> numBytes = (long) fieldEntry.size() * dimension;
+      case FLOAT32 -> numBytes = (long) fieldEntry.size() * dimension * Float.BYTES;
+      default -> throw new AssertionError("unknown vector encoding " + fieldEntry.vectorEncoding);

Review Comment:
   Should we also update the error message that follows for `BYTE` case?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] msokolov commented on pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

Posted by GitBox <gi...@apache.org>.

msokolov commented on PR #947:
URL: https://github.com/apache/lucene/pull/947#issuecomment-1180965832

   Also - if anybody has advice about how to rebase while maintaining this PR I'd be interested. Should I `git merge` from `main`??


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] msokolov commented on pull request #947: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

Posted by GitBox <gi...@apache.org>.

msokolov commented on PR #947:
URL: https://github.com/apache/lucene/pull/947#issuecomment-1181017377

   >  During merging when writing a merged vector field it looks like we first expand vector values only to again to compress them later? Would be nice to avoid this.
   
   In fact after checking, I don't think we are doing this expand/compress step *even though getVectorValues() returns `ExpandingVectorValues` to the merger*. This is because the merger uses the `binaryValue()` call to write the vectors themselves, and that value is unchanged by EVV, and the graph is created by the (now) polymorphic hnsw utils that also call `binaryValue()` when they are dealing with a field encoded as bytes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org