You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/02/04 21:56:14 UTC

[GitHub] [lucene] mayya-sharipova opened a new pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

mayya-sharipova opened a new pull request #649:
URL: https://github.com/apache/lucene/pull/649


   Better encoding of doc Ids in Lucene91HnswVectorsFormat
   
   Currently we write doc Ids of all documents that have vectors
   not very efficiently.
   This improve their encoding by:
   - for a case when all documents have vectors, we don't write
   document IDs.
   - for a case when only certain documents have vectors, we
   do delta encoding of doc Ids.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #649:
URL: https://github.com/apache/lucene/pull/649#discussion_r805395018



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsWriter.java
##########
@@ -206,14 +214,19 @@ private void writeMeta(
     meta.writeVLong(vectorIndexOffset);
     meta.writeVLong(vectorIndexLength);
     meta.writeInt(field.getVectorDimension());
-    meta.writeInt(docIds.length);
-    for (int docId : docIds) {
-      // TODO: delta-encode, or write as bitset
-      meta.writeVInt(docId);
+
+    // write docIDs
+    meta.writeInt(count);
+    if (docIds == null) {
+      meta.writeShort((short) -1); // dense marker, each document has a vector value

Review comment:
       No particular reason, just copied it from another file. I've modified the code to use `writeByte` instead in ca9bfb25de247fdcf09a2039ef4b25f6c87d5c6d




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #649:
URL: https://github.com/apache/lucene/pull/649#discussion_r802152096



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsWriter.java
##########
@@ -140,7 +142,10 @@ public void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader)
       // build the graph using the temporary vector data
       Lucene91HnswVectorsReader.OffHeapVectorValues offHeapVectors =
           new Lucene91HnswVectorsReader.OffHeapVectorValues(
-              vectors.dimension(), docIds, vectorDataInput);
+              vectors.dimension(),
+              docsWithField.cardinality(),
+              null, // graph construction doesn't need to know docIds

Review comment:
       Good comment. I've tried to address this in 6bf1aea543ddb3d19909a5bc3d9ccbb1b4fcf9e4.  For the sparse case, I fill this array `docIds`  in memory as was before.  Later, when we decouple the random vector values access and iterator, we can remove this part.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

Posted by GitBox <gi...@apache.org>.

jpountz commented on pull request #649:
URL: https://github.com/apache/lucene/pull/649#issuecomment-1033729822


   > I am also wondering since we use binarySearch on docIds array, would it still be acceptable to have this array on disk?
   
   I think so. It's a bit like the terms index, the access pattern is random, but this data should be small enough that it would generally be in the FS cache.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

Posted by GitBox <gi...@apache.org>.

jpountz commented on pull request #649:
URL: https://github.com/apache/lucene/pull/649#issuecomment-1033730737


   Of course if we can have a less random access pattern (e.g. maybe by using exponential search, having skip lists like postings or jump tables like doc values) it would be even better.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #649:
URL: https://github.com/apache/lucene/pull/649#discussion_r807998586



##########
File path: lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java
##########
@@ -1018,4 +1020,57 @@ public void testAdvance() throws Exception {
       }
     }
   }
+
+  public void testVectorValuesReportCorrectDocs() throws Exception {
+    final int numDocs = atLeast(1000);
+    final int dim = random().nextInt(20) + 1;
+    final VectorSimilarityFunction similarityFunction =
+        VectorSimilarityFunction.values()[
+            random().nextInt(VectorSimilarityFunction.values().length)];
+
+    float fieldValuesCheckSum = 0f;
+    int fieldDocCount = 0;
+    long fieldSumDocIDs = 0;
+
+    try (Directory dir = newDirectory();
+        RandomIndexWriter w = new RandomIndexWriter(random(), dir, newIndexWriterConfig())) {
+      for (int i = 0; i < numDocs; i++) {
+        Document doc = new Document();
+        int docID = random().nextInt(numDocs);
+        doc.add(new StoredField("id", docID));
+        if (random().nextInt(4) == 3) {
+          float[] vector = randomVector(dim);
+          doc.add(new KnnVectorField("knn_vector", vector, similarityFunction));
+          fieldValuesCheckSum += vector[0];
+          fieldDocCount++;
+          fieldSumDocIDs += docID;
+        }
+        w.addDocument(doc);
+      }
+
+      if (random().nextBoolean()) {
+        w.forceMerge(1);
+      }
+
+      try (IndexReader r = w.getReader()) {
+        float checksum = 0;

Review comment:
       @jtibshirani Thanks for your feedback and comment.  What did you mean by "vectors were out of order"?   `VectorValues` `extends DocIdSetIterator`  and are expected to be accessed in the increasing doc ID order.
   
   Or did you mean `RandomAccessVectorValues`?  This class doesn't concern itself with doc Ids, so we should not worry docIds.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on pull request #649:
URL: https://github.com/apache/lucene/pull/649#issuecomment-1038311005


   @jpountz @jtibshirani Thank you for your suggestions.  Are you ok if we keep this PR as it is now to optimize the dense case only,  and explore Adrien's suggestions for a sparse case in subsequent work?  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #649:
URL: https://github.com/apache/lucene/pull/649#discussion_r807998586



##########
File path: lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java
##########
@@ -1018,4 +1020,57 @@ public void testAdvance() throws Exception {
       }
     }
   }
+
+  public void testVectorValuesReportCorrectDocs() throws Exception {
+    final int numDocs = atLeast(1000);
+    final int dim = random().nextInt(20) + 1;
+    final VectorSimilarityFunction similarityFunction =
+        VectorSimilarityFunction.values()[
+            random().nextInt(VectorSimilarityFunction.values().length)];
+
+    float fieldValuesCheckSum = 0f;
+    int fieldDocCount = 0;
+    long fieldSumDocIDs = 0;
+
+    try (Directory dir = newDirectory();
+        RandomIndexWriter w = new RandomIndexWriter(random(), dir, newIndexWriterConfig())) {
+      for (int i = 0; i < numDocs; i++) {
+        Document doc = new Document();
+        int docID = random().nextInt(numDocs);
+        doc.add(new StoredField("id", docID));
+        if (random().nextInt(4) == 3) {
+          float[] vector = randomVector(dim);
+          doc.add(new KnnVectorField("knn_vector", vector, similarityFunction));
+          fieldValuesCheckSum += vector[0];
+          fieldDocCount++;
+          fieldSumDocIDs += docID;
+        }
+        w.addDocument(doc);
+      }
+
+      if (random().nextBoolean()) {
+        w.forceMerge(1);
+      }
+
+      try (IndexReader r = w.getReader()) {
+        float checksum = 0;

Review comment:
       @jtibshirani Thanks for your feedback and comment.  What did you mean by "vectors were out of order"?   `VectorValues` `extends DocIdSetIterator`  and are expected to be accessed in the increasing doc ID order.
   
   Or did you mean `RandomAccessVectorValues`?  I think this class doesn't concern itself with doc Ids.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #649:
URL: https://github.com/apache/lucene/pull/649#discussion_r802150904



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java
##########
@@ -311,7 +313,9 @@ public void close() throws IOException {
     final int maxConn;
     final int numLevels;
     final int dimension;
+    private final int size;
     final int[] ordToDoc;
+    final IntUnaryOperator ordToDocOperator;

Review comment:
       addressed in 6bf1aea543ddb3d19909a5bc3d9ccbb1b4fcf9e4




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on pull request #649:
URL: https://github.com/apache/lucene/pull/649#issuecomment-1035648183


   Additional motivation for this PR: it could help with performance of exact search (in https://github.com/apache/lucene/pull/656). When all docs have vectors, we can avoid a binary search in `VectorValues#advance`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on a change in pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on a change in pull request #649:
URL: https://github.com/apache/lucene/pull/649#discussion_r808504111



##########
File path: lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java
##########
@@ -1018,4 +1020,57 @@ public void testAdvance() throws Exception {
       }
     }
   }
+
+  public void testVectorValuesReportCorrectDocs() throws Exception {
+    final int numDocs = atLeast(1000);
+    final int dim = random().nextInt(20) + 1;
+    final VectorSimilarityFunction similarityFunction =
+        VectorSimilarityFunction.values()[
+            random().nextInt(VectorSimilarityFunction.values().length)];
+
+    float fieldValuesCheckSum = 0f;
+    int fieldDocCount = 0;
+    long fieldSumDocIDs = 0;
+
+    try (Directory dir = newDirectory();
+        RandomIndexWriter w = new RandomIndexWriter(random(), dir, newIndexWriterConfig())) {
+      for (int i = 0; i < numDocs; i++) {
+        Document doc = new Document();
+        int docID = random().nextInt(numDocs);
+        doc.add(new StoredField("id", docID));
+        if (random().nextInt(4) == 3) {
+          float[] vector = randomVector(dim);
+          doc.add(new KnnVectorField("knn_vector", vector, similarityFunction));
+          fieldValuesCheckSum += vector[0];
+          fieldDocCount++;
+          fieldSumDocIDs += docID;
+        }
+        w.addDocument(doc);
+      }
+
+      if (random().nextBoolean()) {
+        w.forceMerge(1);
+      }
+
+      try (IndexReader r = w.getReader()) {
+        float checksum = 0;

Review comment:
       Sorry I read this too fast and wrote a confusing comment :) This check looks good to me.

##########
File path: lucene/CHANGES.txt
##########
@@ -204,6 +204,8 @@ Optimizations
 * LUCENE-10367: Optimize CoveringQuery for the case when the minimum number of
   matching clauses is a constant. (LuYunCheng via Adrien Grand)
 
+* LUCENE-10408 Better encoding of doc Ids in vectors (Mayya Sharipova, Julie Tibshirani, Adrien Grand)

Review comment:
       Thanks for including me! I'm also fine if you omit me when I'm a reviewer.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on a change in pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on a change in pull request #649:
URL: https://github.com/apache/lucene/pull/649#discussion_r806160008



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsWriter.java
##########
@@ -206,14 +203,22 @@ private void writeMeta(
     meta.writeVLong(vectorIndexOffset);
     meta.writeVLong(vectorIndexLength);
     meta.writeInt(field.getVectorDimension());
-    meta.writeInt(docIds.length);
-    for (int docId : docIds) {
-      // TODO: delta-encode, or write as bitset
-      meta.writeVInt(docId);
+
+    // write docIDs
+    int count = docsWithField.cardinality();
+    meta.writeInt(count);
+    if (count == maxDoc) {
+      meta.writeByte((byte) -1);
+      ; // dense marker, each document has a vector value

Review comment:
       I think you accidentally added a newline before the comment.

##########
File path: lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java
##########
@@ -1018,4 +1020,57 @@ public void testAdvance() throws Exception {
       }
     }
   }
+
+  public void testVectorValuesReportCorrectDocs() throws Exception {
+    final int numDocs = atLeast(1000);
+    final int dim = random().nextInt(20) + 1;
+    final VectorSimilarityFunction similarityFunction =
+        VectorSimilarityFunction.values()[
+            random().nextInt(VectorSimilarityFunction.values().length)];
+
+    float fieldValuesCheckSum = 0f;
+    int fieldDocCount = 0;
+    long fieldSumDocIDs = 0;
+
+    try (Directory dir = newDirectory();
+        RandomIndexWriter w = new RandomIndexWriter(random(), dir, newIndexWriterConfig())) {
+      for (int i = 0; i < numDocs; i++) {
+        Document doc = new Document();
+        int docID = random().nextInt(numDocs);
+        doc.add(new StoredField("id", docID));
+        if (random().nextInt(4) == 3) {
+          float[] vector = randomVector(dim);
+          doc.add(new KnnVectorField("knn_vector", vector, similarityFunction));
+          fieldValuesCheckSum += vector[0];
+          fieldDocCount++;
+          fieldSumDocIDs += docID;
+        }
+        w.addDocument(doc);
+      }
+
+      if (random().nextBoolean()) {
+        w.forceMerge(1);
+      }
+
+      try (IndexReader r = w.getReader()) {
+        float checksum = 0;

Review comment:
       Would the checksum and doc IDs still match if the vectors were out of order? Maybe worth strengthening this check. 

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java
##########
@@ -424,38 +448,45 @@ public int docID() {
 
     @Override
     public int nextDoc() {
-      if (++ord >= size()) {
+      if (++ord >= size) {
         doc = NO_MORE_DOCS;
       } else {
-        doc = ordToDoc[ord];
+        doc = ordToDocOperator.applyAsInt(ord);
       }
       return doc;
     }
 
     @Override
     public int advance(int target) {
       assert docID() < target;
-      ord = Arrays.binarySearch(ordToDoc, ord + 1, ordToDoc.length, target);
-      if (ord < 0) {
-        ord = -(ord + 1);
+
+      if (ordToDoc == null) {
+        ord = target;
+      } else {
+        ord = Arrays.binarySearch(ordToDoc, ord + 1, ordToDoc.length, target);
+        if (ord < 0) {
+          ord = -(ord + 1);
+        }
       }
-      assert ord <= ordToDoc.length;
-      if (ord == ordToDoc.length) {
+
+      assert ord <= size;
+      if (ord == size) {
         doc = NO_MORE_DOCS;
       } else {
-        doc = ordToDoc[ord];
+        doc = ordToDocOperator.applyAsInt(ord);
+        ;

Review comment:
       stray semicolon

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java
##########
@@ -266,12 +268,12 @@ private Bits getAcceptOrds(Bits acceptDocs, FieldEntry fieldEntry) {
     return new Bits() {
       @Override
       public boolean get(int index) {
-        return acceptDocs.get(fieldEntry.ordToDoc[index]);
+        return acceptDocs.get(fieldEntry.ordToDoc(index));

Review comment:
       We could add a small optimization here where we check `fieldEntry.ordToDocs == null`, and if so just return `acceptDocs`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on a change in pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on a change in pull request #649:
URL: https://github.com/apache/lucene/pull/649#discussion_r806160008



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsWriter.java
##########
@@ -206,14 +203,22 @@ private void writeMeta(
     meta.writeVLong(vectorIndexOffset);
     meta.writeVLong(vectorIndexLength);
     meta.writeInt(field.getVectorDimension());
-    meta.writeInt(docIds.length);
-    for (int docId : docIds) {
-      // TODO: delta-encode, or write as bitset
-      meta.writeVInt(docId);
+
+    // write docIDs
+    int count = docsWithField.cardinality();
+    meta.writeInt(count);
+    if (count == maxDoc) {
+      meta.writeByte((byte) -1);
+      ; // dense marker, each document has a vector value

Review comment:
       I think you accidentally added a newline before the comment.

##########
File path: lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java
##########
@@ -1018,4 +1020,57 @@ public void testAdvance() throws Exception {
       }
     }
   }
+
+  public void testVectorValuesReportCorrectDocs() throws Exception {
+    final int numDocs = atLeast(1000);
+    final int dim = random().nextInt(20) + 1;
+    final VectorSimilarityFunction similarityFunction =
+        VectorSimilarityFunction.values()[
+            random().nextInt(VectorSimilarityFunction.values().length)];
+
+    float fieldValuesCheckSum = 0f;
+    int fieldDocCount = 0;
+    long fieldSumDocIDs = 0;
+
+    try (Directory dir = newDirectory();
+        RandomIndexWriter w = new RandomIndexWriter(random(), dir, newIndexWriterConfig())) {
+      for (int i = 0; i < numDocs; i++) {
+        Document doc = new Document();
+        int docID = random().nextInt(numDocs);
+        doc.add(new StoredField("id", docID));
+        if (random().nextInt(4) == 3) {
+          float[] vector = randomVector(dim);
+          doc.add(new KnnVectorField("knn_vector", vector, similarityFunction));
+          fieldValuesCheckSum += vector[0];
+          fieldDocCount++;
+          fieldSumDocIDs += docID;
+        }
+        w.addDocument(doc);
+      }
+
+      if (random().nextBoolean()) {
+        w.forceMerge(1);
+      }
+
+      try (IndexReader r = w.getReader()) {
+        float checksum = 0;

Review comment:
       Would the checksum and doc IDs still match if the vectors were out of order? Maybe worth strengthening this check. 

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java
##########
@@ -424,38 +448,45 @@ public int docID() {
 
     @Override
     public int nextDoc() {
-      if (++ord >= size()) {
+      if (++ord >= size) {
         doc = NO_MORE_DOCS;
       } else {
-        doc = ordToDoc[ord];
+        doc = ordToDocOperator.applyAsInt(ord);
       }
       return doc;
     }
 
     @Override
     public int advance(int target) {
       assert docID() < target;
-      ord = Arrays.binarySearch(ordToDoc, ord + 1, ordToDoc.length, target);
-      if (ord < 0) {
-        ord = -(ord + 1);
+
+      if (ordToDoc == null) {
+        ord = target;
+      } else {
+        ord = Arrays.binarySearch(ordToDoc, ord + 1, ordToDoc.length, target);
+        if (ord < 0) {
+          ord = -(ord + 1);
+        }
       }
-      assert ord <= ordToDoc.length;
-      if (ord == ordToDoc.length) {
+
+      assert ord <= size;
+      if (ord == size) {
         doc = NO_MORE_DOCS;
       } else {
-        doc = ordToDoc[ord];
+        doc = ordToDocOperator.applyAsInt(ord);
+        ;

Review comment:
       stray semicolon

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java
##########
@@ -266,12 +268,12 @@ private Bits getAcceptOrds(Bits acceptDocs, FieldEntry fieldEntry) {
     return new Bits() {
       @Override
       public boolean get(int index) {
-        return acceptDocs.get(fieldEntry.ordToDoc[index]);
+        return acceptDocs.get(fieldEntry.ordToDoc(index));

Review comment:
       We could add a small optimization here where we check `fieldEntry.ordToDocs == null`, and if so just return `acceptDocs`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #649:
URL: https://github.com/apache/lucene/pull/649#discussion_r807992728



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java
##########
@@ -424,38 +448,45 @@ public int docID() {
 
     @Override
     public int nextDoc() {
-      if (++ord >= size()) {
+      if (++ord >= size) {
         doc = NO_MORE_DOCS;
       } else {
-        doc = ordToDoc[ord];
+        doc = ordToDocOperator.applyAsInt(ord);
       }
       return doc;
     }
 
     @Override
     public int advance(int target) {
       assert docID() < target;
-      ord = Arrays.binarySearch(ordToDoc, ord + 1, ordToDoc.length, target);
-      if (ord < 0) {
-        ord = -(ord + 1);
+
+      if (ordToDoc == null) {
+        ord = target;
+      } else {
+        ord = Arrays.binarySearch(ordToDoc, ord + 1, ordToDoc.length, target);
+        if (ord < 0) {
+          ord = -(ord + 1);
+        }
       }
-      assert ord <= ordToDoc.length;
-      if (ord == ordToDoc.length) {
+
+      assert ord <= size;
+      if (ord == size) {
         doc = NO_MORE_DOCS;
       } else {
-        doc = ordToDoc[ord];
+        doc = ordToDocOperator.applyAsInt(ord);
+        ;

Review comment:
       Addressed in 47042f2




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on pull request #649:
URL: https://github.com/apache/lucene/pull/649#issuecomment-1033175828


   @jtibshirani @jpountz Thanks for your feedback. I've tried to address in 6bf1aea543ddb3d19909a5bc3d9ccbb1b4fcf9e4.  I've decided to focus this PR only on optimizing the dense case and keeping the sparse case as was before – uncompressed way.
   
   > I wonder if it would be a better trade-off to keep ints uncompressed, but read them from disk directly instead of loading giant arrays in memory? Or possibly switch to something like DirectMonotonicReader if it doesn't slow down searches.
   
   @jpountz  Thank you for the suggestion, Adrien.  I've put this as TODO in the code to explore.  I am also wondering since we use binarySearch on docIds array, would it still be acceptable to have this array  on disk?  Do we have a precedent for such a case?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #649:
URL: https://github.com/apache/lucene/pull/649#discussion_r807998586



##########
File path: lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java
##########
@@ -1018,4 +1020,57 @@ public void testAdvance() throws Exception {
       }
     }
   }
+
+  public void testVectorValuesReportCorrectDocs() throws Exception {
+    final int numDocs = atLeast(1000);
+    final int dim = random().nextInt(20) + 1;
+    final VectorSimilarityFunction similarityFunction =
+        VectorSimilarityFunction.values()[
+            random().nextInt(VectorSimilarityFunction.values().length)];
+
+    float fieldValuesCheckSum = 0f;
+    int fieldDocCount = 0;
+    long fieldSumDocIDs = 0;
+
+    try (Directory dir = newDirectory();
+        RandomIndexWriter w = new RandomIndexWriter(random(), dir, newIndexWriterConfig())) {
+      for (int i = 0; i < numDocs; i++) {
+        Document doc = new Document();
+        int docID = random().nextInt(numDocs);
+        doc.add(new StoredField("id", docID));
+        if (random().nextInt(4) == 3) {
+          float[] vector = randomVector(dim);
+          doc.add(new KnnVectorField("knn_vector", vector, similarityFunction));
+          fieldValuesCheckSum += vector[0];
+          fieldDocCount++;
+          fieldSumDocIDs += docID;
+        }
+        w.addDocument(doc);
+      }
+
+      if (random().nextBoolean()) {
+        w.forceMerge(1);
+      }
+
+      try (IndexReader r = w.getReader()) {
+        float checksum = 0;

Review comment:
       @jtibshirani Thanks for your feedback and comment.  What did you mean by "vectors were out of order"?   `VectorValues` `extends DocIdSetIterator`  and are expected to be accessed in the increasing doc ID order.
   
   Or did you mean `RandomAccessVectorValues`?  This class doesn't concern itself with doc Ids, so we should not worry about docIds in this case.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #649:
URL: https://github.com/apache/lucene/pull/649#discussion_r807992105



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsWriter.java
##########
@@ -206,14 +203,22 @@ private void writeMeta(
     meta.writeVLong(vectorIndexOffset);
     meta.writeVLong(vectorIndexLength);
     meta.writeInt(field.getVectorDimension());
-    meta.writeInt(docIds.length);
-    for (int docId : docIds) {
-      // TODO: delta-encode, or write as bitset
-      meta.writeVInt(docId);
+
+    // write docIDs
+    int count = docsWithField.cardinality();
+    meta.writeInt(count);
+    if (count == maxDoc) {
+      meta.writeByte((byte) -1);
+      ; // dense marker, each document has a vector value

Review comment:
       Addressed in 47042f2




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

Posted by GitBox <gi...@apache.org>.

jpountz commented on pull request #649:
URL: https://github.com/apache/lucene/pull/649#issuecomment-1031754530


   Optimizing for the case when all docs have a value makes sense to me.
   
   > for a case when only certain documents have vectors, we do delta encoding of doc Ids.
   
   In the past we rejected changes that would consist of having the data written in a compressed fashion on disk but still uncompressed in memory.
   
   I wonder if it would be a better trade-off to keep ints uncompressed, but read them from disk directly instead of loading giant arrays in memory? Or possibly switch to something like DirectMonotonicReader if it doesn't slow down searches.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #649:
URL: https://github.com/apache/lucene/pull/649#discussion_r805395018



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsWriter.java
##########
@@ -206,14 +214,19 @@ private void writeMeta(
     meta.writeVLong(vectorIndexOffset);
     meta.writeVLong(vectorIndexLength);
     meta.writeInt(field.getVectorDimension());
-    meta.writeInt(docIds.length);
-    for (int docId : docIds) {
-      // TODO: delta-encode, or write as bitset
-      meta.writeVInt(docId);
+
+    // write docIDs
+    meta.writeInt(count);
+    if (docIds == null) {
+      meta.writeShort((short) -1); // dense marker, each document has a vector value

Review comment:
       No particular reason, just copied it from another file. I've modified the code to use `writeByte` instead in ca9bfb25de247fdcf09a2039ef4b25f6c87d5c6d

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java
##########
@@ -372,7 +393,9 @@ int size() {
       implements RandomAccessVectorValues, RandomAccessVectorValuesProducer {
 
     final int dimension;
+    final int size;

Review comment:
       Good comment, addressed in ca9bfb25de247fdcf09a2039ef4b25f6c87d5c6d

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsWriter.java
##########
@@ -138,9 +140,20 @@ public void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader)
 
       long vectorIndexOffset = vectorIndex.getFilePointer();
       // build the graph using the temporary vector data
+      int count = docsWithField.cardinality();
+      int[] docIds = null;
+      if (count < maxDoc) {

Review comment:
       Makes sense to be, I've reverted in ca9bfb25de247fdcf09a2039ef4b25f6c87d5c6d to use the previous approach in  with added comment.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on a change in pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on a change in pull request #649:
URL: https://github.com/apache/lucene/pull/649#discussion_r799925962



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsWriter.java
##########
@@ -140,7 +142,10 @@ public void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader)
       // build the graph using the temporary vector data
       Lucene91HnswVectorsReader.OffHeapVectorValues offHeapVectors =
           new Lucene91HnswVectorsReader.OffHeapVectorValues(
-              vectors.dimension(), docIds, vectorDataInput);
+              vectors.dimension(),
+              docsWithField.cardinality(),
+              null, // graph construction doesn't need to know docIds

Review comment:
       hmm this feels a little fragile. I see how it's hard to improve though, since the random access and iterator functionality is tied together in a single class.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java
##########
@@ -311,7 +313,9 @@ public void close() throws IOException {
     final int maxConn;
     final int numLevels;
     final int dimension;
+    private final int size;
     final int[] ordToDoc;
+    final IntUnaryOperator ordToDocOperator;

Review comment:
       We could make this `private` since it doesn't need to be accessed externally.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #649:
URL: https://github.com/apache/lucene/pull/649#discussion_r805395172



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsWriter.java
##########
@@ -138,9 +140,20 @@ public void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader)
 
       long vectorIndexOffset = vectorIndex.getFilePointer();
       // build the graph using the temporary vector data
+      int count = docsWithField.cardinality();
+      int[] docIds = null;
+      if (count < maxDoc) {

Review comment:
       Makes sense to be, I've reverted in ca9bfb25de247fdcf09a2039ef4b25f6c87d5c6d to use the previous approach in  with added comment.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #649:
URL: https://github.com/apache/lucene/pull/649#discussion_r807998586



##########
File path: lucene/test-framework/src/java/org/apache/lucene/tests/index/BaseKnnVectorsFormatTestCase.java
##########
@@ -1018,4 +1020,57 @@ public void testAdvance() throws Exception {
       }
     }
   }
+
+  public void testVectorValuesReportCorrectDocs() throws Exception {
+    final int numDocs = atLeast(1000);
+    final int dim = random().nextInt(20) + 1;
+    final VectorSimilarityFunction similarityFunction =
+        VectorSimilarityFunction.values()[
+            random().nextInt(VectorSimilarityFunction.values().length)];
+
+    float fieldValuesCheckSum = 0f;
+    int fieldDocCount = 0;
+    long fieldSumDocIDs = 0;
+
+    try (Directory dir = newDirectory();
+        RandomIndexWriter w = new RandomIndexWriter(random(), dir, newIndexWriterConfig())) {
+      for (int i = 0; i < numDocs; i++) {
+        Document doc = new Document();
+        int docID = random().nextInt(numDocs);
+        doc.add(new StoredField("id", docID));
+        if (random().nextInt(4) == 3) {
+          float[] vector = randomVector(dim);
+          doc.add(new KnnVectorField("knn_vector", vector, similarityFunction));
+          fieldValuesCheckSum += vector[0];
+          fieldDocCount++;
+          fieldSumDocIDs += docID;
+        }
+        w.addDocument(doc);
+      }
+
+      if (random().nextBoolean()) {
+        w.forceMerge(1);
+      }
+
+      try (IndexReader r = w.getReader()) {
+        float checksum = 0;

Review comment:
       @jtibshirani Thanks for your feedback and comment.  What did you mean by "vectors were out of order"?   `VectorValues` `extends DocIdSetIterator`  and are expected to be accessed in the increasing doc IDs order.
   
   Or did you mean `RandomAccessVectorValues`?  I think this class doesn't concern itself with doc Ids.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova merged pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

Posted by GitBox <gi...@apache.org>.

mayya-sharipova merged pull request #649:
URL: https://github.com/apache/lucene/pull/649


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #649:
URL: https://github.com/apache/lucene/pull/649#discussion_r807993536



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java
##########
@@ -266,12 +268,12 @@ private Bits getAcceptOrds(Bits acceptDocs, FieldEntry fieldEntry) {
     return new Bits() {
       @Override
       public boolean get(int index) {
-        return acceptDocs.get(fieldEntry.ordToDoc[index]);
+        return acceptDocs.get(fieldEntry.ordToDoc(index));

Review comment:
       Great comment, addressed in 47042f2




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova edited a comment on pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

Posted by GitBox <gi...@apache.org>.

mayya-sharipova edited a comment on pull request #649:
URL: https://github.com/apache/lucene/pull/649#issuecomment-1033175828


   @jtibshirani @jpountz Thanks for your feedback. I've tried to address in 6bf1aea543ddb3d19909a5bc3d9ccbb1b4fcf9e4.  I've decided to focus this PR only on optimizing the dense case and keeping the sparse case as was before – uncompressed way.
   
   > I wonder if it would be a better trade-off to keep ints uncompressed, but read them from disk directly instead of loading giant arrays in memory? Or possibly switch to something like DirectMonotonicReader if it doesn't slow down searches.
   
   @jpountz  Thank you for the suggestion, Adrien.  I've put this as TODO in the code to explore.  I am also wondering since we use binarySearch on docIds array, would it still be acceptable to have this array  on disk?  Do we have a precedent for such use case for reading from disk?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on a change in pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on a change in pull request #649:
URL: https://github.com/apache/lucene/pull/649#discussion_r804189512



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsWriter.java
##########
@@ -138,9 +140,20 @@ public void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader)
 
       long vectorIndexOffset = vectorIndex.getFilePointer();
       // build the graph using the temporary vector data
+      int count = docsWithField.cardinality();
+      int[] docIds = null;
+      if (count < maxDoc) {

Review comment:
       Although it was a bit fragile, I preferred the previous approach of passing `null` with a clear comment. Now it seems like we're doing (potentially significant?) extra work that will not be used.

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsWriter.java
##########
@@ -206,14 +214,19 @@ private void writeMeta(
     meta.writeVLong(vectorIndexOffset);
     meta.writeVLong(vectorIndexLength);
     meta.writeInt(field.getVectorDimension());
-    meta.writeInt(docIds.length);
-    for (int docId : docIds) {
-      // TODO: delta-encode, or write as bitset
-      meta.writeVInt(docId);
+
+    // write docIDs
+    meta.writeInt(count);
+    if (docIds == null) {
+      meta.writeShort((short) -1); // dense marker, each document has a vector value

Review comment:
       Any reason not to use `writeByte` here?

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java
##########
@@ -372,7 +393,9 @@ int size() {
       implements RandomAccessVectorValues, RandomAccessVectorValuesProducer {
 
     final int dimension;
+    final int size;

Review comment:
       Small comment, maybe we can make all of these variables (including the new ones) private.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a change in pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on a change in pull request #649:
URL: https://github.com/apache/lucene/pull/649#discussion_r805395073



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java
##########
@@ -372,7 +393,9 @@ int size() {
       implements RandomAccessVectorValues, RandomAccessVectorValuesProducer {
 
     final int dimension;
+    final int size;

Review comment:
       Good comment, addressed in ca9bfb25de247fdcf09a2039ef4b25f6c87d5c6d




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on pull request #649: LUCENE-10408 Better encoding of doc Ids in vectors

Posted by GitBox <gi...@apache.org>.

mayya-sharipova commented on pull request #649:
URL: https://github.com/apache/lucene/pull/649#issuecomment-1038311005


   @jpountz @jtibshirani Thank you for your suggestions.  Are you ok if we keep this PR as it is now to optimize the dense case only,  and explore Adrien's suggestions for a sparse case in subsequent work?  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org