You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/01/23 20:01:28 UTC

[GitHub] [lucene-solr] msokolov opened a new pull request #2239: LUCENE-9695: don't merge deleted vectors

msokolov opened a new pull request #2239:
URL: https://github.com/apache/lucene-solr/pull/2239


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] msokolov merged pull request #2239: LUCENE-9695: don't merge deleted vectors

Posted by GitBox <gi...@apache.org>.
msokolov merged pull request #2239:
URL: https://github.com/apache/lucene-solr/pull/2239


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] msokolov commented on a change in pull request #2239: LUCENE-9695: don't merge deleted vectors

Posted by GitBox <gi...@apache.org>.
msokolov commented on a change in pull request #2239:
URL: https://github.com/apache/lucene-solr/pull/2239#discussion_r563309172



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/VectorWriter.java
##########
@@ -153,13 +153,12 @@ public int nextDoc() throws IOException {
     private final DocIDMerger<VectorValuesSub> docIdMerger;
     private final int[] ordBase;
     private final int cost;
-    private final int size;
+    private int size;
 
     private int docId;
     private VectorValuesSub current;
-    // For each doc with a vector, record its ord in the segments being merged. This enables random
-    // access into the
-    // unmerged segments using the ords from the merged segment.
+    /* For each doc with a vector, record its ord in the segments being merged. This enables random access into the unmerged segments using the ords from the merged segment.
+     */

Review comment:
       Hmmm I thought spotless would wrap this line, but it doesn't seem to complain about it

##########
File path: lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java
##########
@@ -578,6 +572,8 @@ private int createIndex(Path docsPath, Path indexPath) throws IOException {
     IndexWriterConfig iwc = new IndexWriterConfig().setOpenMode(IndexWriterConfig.OpenMode.CREATE);
     // iwc.setMergePolicy(NoMergePolicy.INSTANCE);
     iwc.setRAMBufferSizeMB(1994d);
+    iwc.setMaxBufferedDocs(10000);

Review comment:
       Oh I did not mean to include this change here. Probably we will want to have some command line parameter to control this, but for now having the default be to index a large segment is probably better

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/VectorWriter.java
##########
@@ -194,6 +197,7 @@ public int nextDoc() throws IOException {
       current = docIdMerger.next();
       if (current == null) {
         docId = NO_MORE_DOCS;
+        size = ord;

Review comment:
       Thanks, yes I did.

##########
File path: lucene/core/src/test/org/apache/lucene/index/TestKnnGraph.java
##########
@@ -124,21 +126,84 @@ public void testMerge() throws Exception {
     }
   }
 
-  private void dumpGraph(KnnGraphValues values, int size) throws IOException {
+  /**
+   * Verify that we get the *same* graph by indexing one segment as we do by indexing two segments
+   * and merging.
+   */
+  public void testMergeProducesSameGraph() throws Exception {

Review comment:
       thanks, yes this flushed out the two problems I saw, so I'm pretty confident they are fixed now, after running this a few 100 times. I had also wanted to add a test asserting that KNN search precision remains above some threshold, but sadly with random vectors, it would always eventually fail, even though mostly it would succeed, so not a very useful unit test and I removed it. Probably we can add something to luceneutil

##########
File path: lucene/core/src/test/org/apache/lucene/index/TestVectorValues.java
##########
@@ -748,29 +751,107 @@ public void testRandom() throws Exception {
             assertEquals(dimension, v.length);
             String idString = ctx.reader().document(docId).getField("id").stringValue();
             int id = Integer.parseInt(idString);
-            assertArrayEquals(idString, values[id], v, 0);
-            ++valueCount;
+            if (ctx.reader().getLiveDocs() == null || ctx.reader().getLiveDocs().get(docId)) {
+              assertArrayEquals(idString, values[id], v, 0);
+              ++valueCount;
+            } else {
+              assertNull(values[id]);
+            }
           }
         }
         assertEquals(numValues, valueCount);
-        assertEquals(numValues, totalSize);
+        assertEquals(numValues, totalSize - numDeletes);
+      }
+    }
+  }
+
+  /**
+   * Index random vectors, sometimes skipping documents, sometimes deleting a document, sometimes
+   * merging, sometimes sorting the index, and verify that the expected values can be read back
+   * consistently.
+   */
+  public void testRandom2() throws Exception {

Review comment:
       I forgot why I had done this (there is already a testRandom), so I added some comments explaining how it differs.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] msokolov commented on a change in pull request #2239: LUCENE-9695: don't merge deleted vectors

Posted by GitBox <gi...@apache.org>.
msokolov commented on a change in pull request #2239:
URL: https://github.com/apache/lucene-solr/pull/2239#discussion_r563309172



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/VectorWriter.java
##########
@@ -153,13 +153,12 @@ public int nextDoc() throws IOException {
     private final DocIDMerger<VectorValuesSub> docIdMerger;
     private final int[] ordBase;
     private final int cost;
-    private final int size;
+    private int size;
 
     private int docId;
     private VectorValuesSub current;
-    // For each doc with a vector, record its ord in the segments being merged. This enables random
-    // access into the
-    // unmerged segments using the ords from the merged segment.
+    /* For each doc with a vector, record its ord in the segments being merged. This enables random access into the unmerged segments using the ords from the merged segment.
+     */

Review comment:
       Hmmm I thought spotless would wrap this line, but it doesn't seem to complain about it

##########
File path: lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java
##########
@@ -578,6 +572,8 @@ private int createIndex(Path docsPath, Path indexPath) throws IOException {
     IndexWriterConfig iwc = new IndexWriterConfig().setOpenMode(IndexWriterConfig.OpenMode.CREATE);
     // iwc.setMergePolicy(NoMergePolicy.INSTANCE);
     iwc.setRAMBufferSizeMB(1994d);
+    iwc.setMaxBufferedDocs(10000);

Review comment:
       Oh I did not mean to include this change here. Probably we will want to have some command line parameter to control this, but for now having the default be to index a large segment is probably better

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/VectorWriter.java
##########
@@ -194,6 +197,7 @@ public int nextDoc() throws IOException {
       current = docIdMerger.next();
       if (current == null) {
         docId = NO_MORE_DOCS;
+        size = ord;

Review comment:
       Thanks, yes I did.

##########
File path: lucene/core/src/test/org/apache/lucene/index/TestKnnGraph.java
##########
@@ -124,21 +126,84 @@ public void testMerge() throws Exception {
     }
   }
 
-  private void dumpGraph(KnnGraphValues values, int size) throws IOException {
+  /**
+   * Verify that we get the *same* graph by indexing one segment as we do by indexing two segments
+   * and merging.
+   */
+  public void testMergeProducesSameGraph() throws Exception {

Review comment:
       thanks, yes this flushed out the two problems I saw, so I'm pretty confident they are fixed now, after running this a few 100 times. I had also wanted to add a test asserting that KNN search precision remains above some threshold, but sadly with random vectors, it would always eventually fail, even though mostly it would succeed, so not a very useful unit test and I removed it. Probably we can add something to luceneutil

##########
File path: lucene/core/src/test/org/apache/lucene/index/TestVectorValues.java
##########
@@ -748,29 +751,107 @@ public void testRandom() throws Exception {
             assertEquals(dimension, v.length);
             String idString = ctx.reader().document(docId).getField("id").stringValue();
             int id = Integer.parseInt(idString);
-            assertArrayEquals(idString, values[id], v, 0);
-            ++valueCount;
+            if (ctx.reader().getLiveDocs() == null || ctx.reader().getLiveDocs().get(docId)) {
+              assertArrayEquals(idString, values[id], v, 0);
+              ++valueCount;
+            } else {
+              assertNull(values[id]);
+            }
           }
         }
         assertEquals(numValues, valueCount);
-        assertEquals(numValues, totalSize);
+        assertEquals(numValues, totalSize - numDeletes);
+      }
+    }
+  }
+
+  /**
+   * Index random vectors, sometimes skipping documents, sometimes deleting a document, sometimes
+   * merging, sometimes sorting the index, and verify that the expected values can be read back
+   * consistently.
+   */
+  public void testRandom2() throws Exception {

Review comment:
       I forgot why I had done this (there is already a testRandom), so I added some comments explaining how it differs.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene-solr] mikemccand commented on a change in pull request #2239: LUCENE-9695: don't merge deleted vectors

Posted by GitBox <gi...@apache.org>.
mikemccand commented on a change in pull request #2239:
URL: https://github.com/apache/lucene-solr/pull/2239#discussion_r563217009



##########
File path: lucene/core/src/test/org/apache/lucene/index/TestVectorValues.java
##########
@@ -748,29 +751,107 @@ public void testRandom() throws Exception {
             assertEquals(dimension, v.length);
             String idString = ctx.reader().document(docId).getField("id").stringValue();
             int id = Integer.parseInt(idString);
-            assertArrayEquals(idString, values[id], v, 0);
-            ++valueCount;
+            if (ctx.reader().getLiveDocs() == null || ctx.reader().getLiveDocs().get(docId)) {
+              assertArrayEquals(idString, values[id], v, 0);
+              ++valueCount;
+            } else {
+              assertNull(values[id]);
+            }
           }
         }
         assertEquals(numValues, valueCount);
-        assertEquals(numValues, totalSize);
+        assertEquals(numValues, totalSize - numDeletes);
+      }
+    }
+  }
+
+  /**
+   * Index random vectors, sometimes skipping documents, sometimes deleting a document, sometimes
+   * merging, sometimes sorting the index, and verify that the expected values can be read back
+   * consistently.
+   */
+  public void testRandom2() throws Exception {

Review comment:
       Yay!

##########
File path: lucene/core/src/test/org/apache/lucene/index/TestKnnGraph.java
##########
@@ -124,21 +126,84 @@ public void testMerge() throws Exception {
     }
   }
 
-  private void dumpGraph(KnnGraphValues values, int size) throws IOException {
+  /**
+   * Verify that we get the *same* graph by indexing one segment as we do by indexing two segments
+   * and merging.
+   */
+  public void testMergeProducesSameGraph() throws Exception {

Review comment:
       Nice test :)

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/VectorWriter.java
##########
@@ -194,6 +197,7 @@ public int nextDoc() throws IOException {
       current = docIdMerger.next();
       if (current == null) {
         docId = NO_MORE_DOCS;
+        size = ord;

Review comment:
       Maybe also add a comment here so we understand the sneakiness in context?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org