You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/05/14 14:01:29 UTC

[GitHub] [lucene] dnhatn opened a new pull request #140: LUCENE-9935: Enable bulk-merge for term vectors with index sort

dnhatn opened a new pull request #140:
URL: https://github.com/apache/lucene/pull/140


   This change enables bulk-merge for term vectors with index sort. The algorithm used here is similar to the one that is used to merge stored fields.
   
   Relates #134


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] dnhatn commented on a change in pull request #140: LUCENE-9935: Enable bulk-merge for term vectors with index sort

Posted by GitBox <gi...@apache.org>.

dnhatn commented on a change in pull request #140:
URL: https://github.com/apache/lucene/pull/140#discussion_r648815760



##########
File path: lucene/test-framework/src/java/org/apache/lucene/index/BaseTermVectorsFormatTestCase.java
##########
@@ -667,43 +673,92 @@ public void testRandom() throws IOException {
     dir.close();
   }
 
-  public void testMerge() throws IOException {
+  private void doTestMerge(Sort indexSort) throws IOException {
     final RandomDocumentFactory docFactory = new RandomDocumentFactory(5, 20);
     final int numDocs = atLeast(100);
-    final int numDeletes = random().nextInt(numDocs);
-    final Set<Integer> deletes = new HashSet<>();
-    while (deletes.size() < numDeletes) {
-      deletes.add(random().nextInt(numDocs));
-    }
     for (Options options : validOptions()) {
-      final RandomDocument[] docs = new RandomDocument[numDocs];
+      Map<String, RandomDocument> docs = new HashMap<>();
       for (int i = 0; i < numDocs; ++i) {
-        docs[i] = docFactory.newDocument(TestUtil.nextInt(random(), 1, 3), atLeast(10), options);
+        docs.put(
+            Integer.toString(i),
+            docFactory.newDocument(TestUtil.nextInt(random(), 1, 3), atLeast(10), options));
       }
       final Directory dir = newDirectory();
-      final RandomIndexWriter writer = new RandomIndexWriter(random(), dir);
-      for (int i = 0; i < numDocs; ++i) {
-        writer.addDocument(addId(docs[i].toDocument(), "" + i));
+      final IndexWriterConfig iwc = newIndexWriterConfig();
+      if (indexSort != null) {
+        iwc.setIndexSort(indexSort);
+      }
+      final RandomIndexWriter writer = new RandomIndexWriter(random(), dir, iwc);
+      List<String> liveDocIDs = new ArrayList<>();
+      List<String> ids = new ArrayList<>(docs.keySet());
+      Collections.shuffle(ids, random());
+      Runnable verifyTermVectors =
+          () -> {
+            try (DirectoryReader reader = maybeWrapWithMergingReader(writer.getReader())) {
+              for (String id : liveDocIDs) {
+                final int docID = docID(reader, id);
+                assertEquals(docs.get(id), reader.getTermVectors(docID));
+              }
+            } catch (IOException e) {
+              throw new UncheckedIOException(e);
+            }
+          };
+      for (String id : ids) {
+        final Document doc = addId(docs.get(id).toDocument(), id);
+        if (indexSort != null) {
+          for (SortField sortField : indexSort.getSort()) {
+            doc.add(
+                new NumericDocValuesField(
+                    sortField.getField(), TestUtil.nextInt(random(), 0, 10000)));
+          }

Review comment:
       ++, I reduced the range in https://github.com/apache/lucene/pull/140/commits/af3dc8c66e89d2960e1852152de4418a96b96f7b.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] dnhatn commented on pull request #140: LUCENE-9935: Enable bulk-merge for term vectors with index sort

Posted by GitBox <gi...@apache.org>.

dnhatn commented on pull request #140:
URL: https://github.com/apache/lucene/pull/140#issuecomment-858701554


   @jpountz Thank you :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] dnhatn commented on a change in pull request #140: LUCENE-9935: Enable bulk-merge for term vectors with index sort

Posted by GitBox <gi...@apache.org>.

dnhatn commented on a change in pull request #140:
URL: https://github.com/apache/lucene/pull/140#discussion_r649215227



##########
File path: lucene/test-framework/src/java/org/apache/lucene/index/BaseTermVectorsFormatTestCase.java
##########
@@ -667,43 +673,94 @@ public void testRandom() throws IOException {
     dir.close();
   }
 
-  public void testMerge() throws IOException {
+  private void doTestMerge(Sort indexSort, boolean allowDeletes) throws IOException {
     final RandomDocumentFactory docFactory = new RandomDocumentFactory(5, 20);
     final int numDocs = atLeast(100);
-    final int numDeletes = random().nextInt(numDocs);
-    final Set<Integer> deletes = new HashSet<>();
-    while (deletes.size() < numDeletes) {
-      deletes.add(random().nextInt(numDocs));
-    }
     for (Options options : validOptions()) {
-      final RandomDocument[] docs = new RandomDocument[numDocs];
+      Map<String, RandomDocument> docs = new HashMap<>();
       for (int i = 0; i < numDocs; ++i) {
-        docs[i] = docFactory.newDocument(TestUtil.nextInt(random(), 1, 3), atLeast(10), options);
+        docs.put(
+            Integer.toString(i),
+            docFactory.newDocument(TestUtil.nextInt(random(), 1, 3), atLeast(10), options));
       }
       final Directory dir = newDirectory();
-      final RandomIndexWriter writer = new RandomIndexWriter(random(), dir);
-      for (int i = 0; i < numDocs; ++i) {
-        writer.addDocument(addId(docs[i].toDocument(), "" + i));
+      final IndexWriterConfig iwc = newIndexWriterConfig();
+      if (indexSort != null) {
+        iwc.setIndexSort(indexSort);
+      }
+      final RandomIndexWriter writer = new RandomIndexWriter(random(), dir, iwc);
+      List<String> liveDocIDs = new ArrayList<>();
+      List<String> ids = new ArrayList<>(docs.keySet());
+      Collections.shuffle(ids, random());
+      Runnable verifyTermVectors =
+          () -> {
+            try (DirectoryReader reader = maybeWrapWithMergingReader(writer.getReader())) {
+              for (String id : liveDocIDs) {
+                final int docID = docID(reader, id);
+                assertEquals(docs.get(id), reader.getTermVectors(docID));
+              }
+            } catch (IOException e) {
+              throw new UncheckedIOException(e);
+            }
+          };
+      for (String id : ids) {
+        final Document doc = addId(docs.get(id).toDocument(), id);
+        if (indexSort != null) {
+          for (SortField sortField : indexSort.getSort()) {
+            doc.add(
+                new NumericDocValuesField(
+                    sortField.getField(), TestUtil.nextInt(random(), 0, 1024)));
+          }
+        }
+        if (random().nextInt(100) < 5) {
+          // add via foreign writer
+          IndexWriterConfig otherIwc = newIndexWriterConfig();
+          if (indexSort != null) {
+            otherIwc.setIndexSort(indexSort);
+          }
+          try (Directory otherDir = newDirectory();
+              RandomIndexWriter otherIw = new RandomIndexWriter(random(), otherDir, otherIwc)) {
+            otherIw.addDocument(doc);
+            try (DirectoryReader otherReader = otherIw.getReader()) {
+              TestUtil.addIndexesSlowly(writer.w, otherReader);
+            }
+          }
+        } else {
+          writer.addDocument(doc);
+        }
+        liveDocIDs.add(id);
+        if (allowDeletes && random().nextInt(100) < 5) {

Review comment:
       If we delete every iteration, then we will delete every newly added document and we won't have any docs at the end. I creased the probability of this from 5% to 20% (see https://github.com/apache/lucene/pull/140/commits/65a1ecb38786601e73ff1dd78b6fdb22b0f1ff48).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] dnhatn commented on pull request #140: LUCENE-9935: Enable bulk-merge for term vectors with index sort

Posted by GitBox <gi...@apache.org>.

dnhatn commented on pull request #140:
URL: https://github.com/apache/lucene/pull/140#issuecomment-858243141


   @jpountz Thanks for reviews. It's ready again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] dnhatn commented on a change in pull request #140: LUCENE-9935: Enable bulk-merge for term vectors with index sort

Posted by GitBox <gi...@apache.org>.

dnhatn commented on a change in pull request #140:
URL: https://github.com/apache/lucene/pull/140#discussion_r649215227



##########
File path: lucene/test-framework/src/java/org/apache/lucene/index/BaseTermVectorsFormatTestCase.java
##########
@@ -667,43 +673,94 @@ public void testRandom() throws IOException {
     dir.close();
   }
 
-  public void testMerge() throws IOException {
+  private void doTestMerge(Sort indexSort, boolean allowDeletes) throws IOException {
     final RandomDocumentFactory docFactory = new RandomDocumentFactory(5, 20);
     final int numDocs = atLeast(100);
-    final int numDeletes = random().nextInt(numDocs);
-    final Set<Integer> deletes = new HashSet<>();
-    while (deletes.size() < numDeletes) {
-      deletes.add(random().nextInt(numDocs));
-    }
     for (Options options : validOptions()) {
-      final RandomDocument[] docs = new RandomDocument[numDocs];
+      Map<String, RandomDocument> docs = new HashMap<>();
       for (int i = 0; i < numDocs; ++i) {
-        docs[i] = docFactory.newDocument(TestUtil.nextInt(random(), 1, 3), atLeast(10), options);
+        docs.put(
+            Integer.toString(i),
+            docFactory.newDocument(TestUtil.nextInt(random(), 1, 3), atLeast(10), options));
       }
       final Directory dir = newDirectory();
-      final RandomIndexWriter writer = new RandomIndexWriter(random(), dir);
-      for (int i = 0; i < numDocs; ++i) {
-        writer.addDocument(addId(docs[i].toDocument(), "" + i));
+      final IndexWriterConfig iwc = newIndexWriterConfig();
+      if (indexSort != null) {
+        iwc.setIndexSort(indexSort);
+      }
+      final RandomIndexWriter writer = new RandomIndexWriter(random(), dir, iwc);
+      List<String> liveDocIDs = new ArrayList<>();
+      List<String> ids = new ArrayList<>(docs.keySet());
+      Collections.shuffle(ids, random());
+      Runnable verifyTermVectors =
+          () -> {
+            try (DirectoryReader reader = maybeWrapWithMergingReader(writer.getReader())) {
+              for (String id : liveDocIDs) {
+                final int docID = docID(reader, id);
+                assertEquals(docs.get(id), reader.getTermVectors(docID));
+              }
+            } catch (IOException e) {
+              throw new UncheckedIOException(e);
+            }
+          };
+      for (String id : ids) {
+        final Document doc = addId(docs.get(id).toDocument(), id);
+        if (indexSort != null) {
+          for (SortField sortField : indexSort.getSort()) {
+            doc.add(
+                new NumericDocValuesField(
+                    sortField.getField(), TestUtil.nextInt(random(), 0, 1024)));
+          }
+        }
+        if (random().nextInt(100) < 5) {
+          // add via foreign writer
+          IndexWriterConfig otherIwc = newIndexWriterConfig();
+          if (indexSort != null) {
+            otherIwc.setIndexSort(indexSort);
+          }
+          try (Directory otherDir = newDirectory();
+              RandomIndexWriter otherIw = new RandomIndexWriter(random(), otherDir, otherIwc)) {
+            otherIw.addDocument(doc);
+            try (DirectoryReader otherReader = otherIw.getReader()) {
+              TestUtil.addIndexesSlowly(writer.w, otherReader);
+            }
+          }
+        } else {
+          writer.addDocument(doc);
+        }
+        liveDocIDs.add(id);
+        if (allowDeletes && random().nextInt(100) < 5) {

Review comment:
       If we delete every iteration, then we will delete every newly added document and we won't have any docs at the end. I increased the probability of this from 5% to 20% (see https://github.com/apache/lucene/pull/140/commits/65a1ecb38786601e73ff1dd78b6fdb22b0f1ff48).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jpountz commented on a change in pull request #140: LUCENE-9935: Enable bulk-merge for term vectors with index sort

Posted by GitBox <gi...@apache.org>.

jpountz commented on a change in pull request #140:
URL: https://github.com/apache/lucene/pull/140#discussion_r649248079



##########
File path: lucene/test-framework/src/java/org/apache/lucene/index/BaseTermVectorsFormatTestCase.java
##########
@@ -667,43 +673,94 @@ public void testRandom() throws IOException {
     dir.close();
   }
 
-  public void testMerge() throws IOException {
+  private void doTestMerge(Sort indexSort, boolean allowDeletes) throws IOException {
     final RandomDocumentFactory docFactory = new RandomDocumentFactory(5, 20);
     final int numDocs = atLeast(100);
-    final int numDeletes = random().nextInt(numDocs);
-    final Set<Integer> deletes = new HashSet<>();
-    while (deletes.size() < numDeletes) {
-      deletes.add(random().nextInt(numDocs));
-    }
     for (Options options : validOptions()) {
-      final RandomDocument[] docs = new RandomDocument[numDocs];
+      Map<String, RandomDocument> docs = new HashMap<>();
       for (int i = 0; i < numDocs; ++i) {
-        docs[i] = docFactory.newDocument(TestUtil.nextInt(random(), 1, 3), atLeast(10), options);
+        docs.put(
+            Integer.toString(i),
+            docFactory.newDocument(TestUtil.nextInt(random(), 1, 3), atLeast(10), options));
       }
       final Directory dir = newDirectory();
-      final RandomIndexWriter writer = new RandomIndexWriter(random(), dir);
-      for (int i = 0; i < numDocs; ++i) {
-        writer.addDocument(addId(docs[i].toDocument(), "" + i));
+      final IndexWriterConfig iwc = newIndexWriterConfig();
+      if (indexSort != null) {
+        iwc.setIndexSort(indexSort);
+      }
+      final RandomIndexWriter writer = new RandomIndexWriter(random(), dir, iwc);
+      List<String> liveDocIDs = new ArrayList<>();
+      List<String> ids = new ArrayList<>(docs.keySet());
+      Collections.shuffle(ids, random());
+      Runnable verifyTermVectors =
+          () -> {
+            try (DirectoryReader reader = maybeWrapWithMergingReader(writer.getReader())) {
+              for (String id : liveDocIDs) {
+                final int docID = docID(reader, id);
+                assertEquals(docs.get(id), reader.getTermVectors(docID));
+              }
+            } catch (IOException e) {
+              throw new UncheckedIOException(e);
+            }
+          };
+      for (String id : ids) {
+        final Document doc = addId(docs.get(id).toDocument(), id);
+        if (indexSort != null) {
+          for (SortField sortField : indexSort.getSort()) {
+            doc.add(
+                new NumericDocValuesField(
+                    sortField.getField(), TestUtil.nextInt(random(), 0, 1024)));
+          }
+        }
+        if (random().nextInt(100) < 5) {
+          // add via foreign writer
+          IndexWriterConfig otherIwc = newIndexWriterConfig();
+          if (indexSort != null) {
+            otherIwc.setIndexSort(indexSort);
+          }
+          try (Directory otherDir = newDirectory();
+              RandomIndexWriter otherIw = new RandomIndexWriter(random(), otherDir, otherIwc)) {
+            otherIw.addDocument(doc);
+            try (DirectoryReader otherReader = otherIw.getReader()) {
+              TestUtil.addIndexesSlowly(writer.w, otherReader);
+            }
+          }
+        } else {
+          writer.addDocument(doc);
+        }
+        liveDocIDs.add(id);
+        if (allowDeletes && random().nextInt(100) < 5) {

Review comment:
       Thank you I had misunderstood how the test was working.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jpountz commented on a change in pull request #140: LUCENE-9935: Enable bulk-merge for term vectors with index sort

Posted by GitBox <gi...@apache.org>.

jpountz commented on a change in pull request #140:
URL: https://github.com/apache/lucene/pull/140#discussion_r648908154



##########
File path: lucene/test-framework/src/java/org/apache/lucene/index/BaseTermVectorsFormatTestCase.java
##########
@@ -667,43 +673,94 @@ public void testRandom() throws IOException {
     dir.close();
   }
 
-  public void testMerge() throws IOException {
+  private void doTestMerge(Sort indexSort, boolean allowDeletes) throws IOException {
     final RandomDocumentFactory docFactory = new RandomDocumentFactory(5, 20);
     final int numDocs = atLeast(100);
-    final int numDeletes = random().nextInt(numDocs);
-    final Set<Integer> deletes = new HashSet<>();
-    while (deletes.size() < numDeletes) {
-      deletes.add(random().nextInt(numDocs));
-    }
     for (Options options : validOptions()) {
-      final RandomDocument[] docs = new RandomDocument[numDocs];
+      Map<String, RandomDocument> docs = new HashMap<>();
       for (int i = 0; i < numDocs; ++i) {
-        docs[i] = docFactory.newDocument(TestUtil.nextInt(random(), 1, 3), atLeast(10), options);
+        docs.put(
+            Integer.toString(i),
+            docFactory.newDocument(TestUtil.nextInt(random(), 1, 3), atLeast(10), options));
       }
       final Directory dir = newDirectory();
-      final RandomIndexWriter writer = new RandomIndexWriter(random(), dir);
-      for (int i = 0; i < numDocs; ++i) {
-        writer.addDocument(addId(docs[i].toDocument(), "" + i));
+      final IndexWriterConfig iwc = newIndexWriterConfig();
+      if (indexSort != null) {
+        iwc.setIndexSort(indexSort);
+      }
+      final RandomIndexWriter writer = new RandomIndexWriter(random(), dir, iwc);
+      List<String> liveDocIDs = new ArrayList<>();
+      List<String> ids = new ArrayList<>(docs.keySet());
+      Collections.shuffle(ids, random());
+      Runnable verifyTermVectors =
+          () -> {
+            try (DirectoryReader reader = maybeWrapWithMergingReader(writer.getReader())) {
+              for (String id : liveDocIDs) {
+                final int docID = docID(reader, id);
+                assertEquals(docs.get(id), reader.getTermVectors(docID));
+              }
+            } catch (IOException e) {
+              throw new UncheckedIOException(e);
+            }
+          };
+      for (String id : ids) {
+        final Document doc = addId(docs.get(id).toDocument(), id);
+        if (indexSort != null) {
+          for (SortField sortField : indexSort.getSort()) {
+            doc.add(
+                new NumericDocValuesField(
+                    sortField.getField(), TestUtil.nextInt(random(), 0, 1024)));
+          }
+        }
+        if (random().nextInt(100) < 5) {
+          // add via foreign writer
+          IndexWriterConfig otherIwc = newIndexWriterConfig();
+          if (indexSort != null) {
+            otherIwc.setIndexSort(indexSort);
+          }
+          try (Directory otherDir = newDirectory();
+              RandomIndexWriter otherIw = new RandomIndexWriter(random(), otherDir, otherIwc)) {
+            otherIw.addDocument(doc);
+            try (DirectoryReader otherReader = otherIw.getReader()) {
+              TestUtil.addIndexesSlowly(writer.w, otherReader);
+            }
+          }
+        } else {
+          writer.addDocument(doc);
+        }
+        liveDocIDs.add(id);
+        if (allowDeletes && random().nextInt(100) < 5) {

Review comment:
       why are we randomly _not_ deleting docs when `allowDeletes` is true?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jpountz commented on a change in pull request #140: LUCENE-9935: Enable bulk-merge for term vectors with index sort

Posted by GitBox <gi...@apache.org>.

jpountz commented on a change in pull request #140:
URL: https://github.com/apache/lucene/pull/140#discussion_r647654997



##########
File path: lucene/test-framework/src/java/org/apache/lucene/index/BaseTermVectorsFormatTestCase.java
##########
@@ -667,43 +673,92 @@ public void testRandom() throws IOException {
     dir.close();
   }
 
-  public void testMerge() throws IOException {
+  private void doTestMerge(Sort indexSort) throws IOException {
     final RandomDocumentFactory docFactory = new RandomDocumentFactory(5, 20);
     final int numDocs = atLeast(100);
-    final int numDeletes = random().nextInt(numDocs);
-    final Set<Integer> deletes = new HashSet<>();
-    while (deletes.size() < numDeletes) {
-      deletes.add(random().nextInt(numDocs));
-    }
     for (Options options : validOptions()) {
-      final RandomDocument[] docs = new RandomDocument[numDocs];
+      Map<String, RandomDocument> docs = new HashMap<>();
       for (int i = 0; i < numDocs; ++i) {
-        docs[i] = docFactory.newDocument(TestUtil.nextInt(random(), 1, 3), atLeast(10), options);
+        docs.put(
+            Integer.toString(i),
+            docFactory.newDocument(TestUtil.nextInt(random(), 1, 3), atLeast(10), options));
       }
       final Directory dir = newDirectory();
-      final RandomIndexWriter writer = new RandomIndexWriter(random(), dir);
-      for (int i = 0; i < numDocs; ++i) {
-        writer.addDocument(addId(docs[i].toDocument(), "" + i));
+      final IndexWriterConfig iwc = newIndexWriterConfig();
+      if (indexSort != null) {
+        iwc.setIndexSort(indexSort);
+      }
+      final RandomIndexWriter writer = new RandomIndexWriter(random(), dir, iwc);
+      List<String> liveDocIDs = new ArrayList<>();
+      List<String> ids = new ArrayList<>(docs.keySet());
+      Collections.shuffle(ids, random());
+      Runnable verifyTermVectors =
+          () -> {
+            try (DirectoryReader reader = maybeWrapWithMergingReader(writer.getReader())) {
+              for (String id : liveDocIDs) {
+                final int docID = docID(reader, id);
+                assertEquals(docs.get(id), reader.getTermVectors(docID));
+              }
+            } catch (IOException e) {
+              throw new UncheckedIOException(e);
+            }
+          };
+      for (String id : ids) {
+        final Document doc = addId(docs.get(id).toDocument(), id);
+        if (indexSort != null) {
+          for (SortField sortField : indexSort.getSort()) {
+            doc.add(
+                new NumericDocValuesField(
+                    sortField.getField(), TestUtil.nextInt(random(), 0, 10000)));
+          }

Review comment:
       it's unlikely merging will find long sequences of doc IDs to merge with such a large range?

##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingTermVectorsReader.java
##########
@@ -91,6 +91,7 @@
   private final long numDirtyChunks; // number of incomplete compressed blocks written
   private final long numDirtyDocs; // cumulative number of docs in incomplete chunks
   private final long maxPointer; // end of the data section
+  private BlockState blockState = new BlockState(-1, -1, -1);

Review comment:
       ```suggestion
     private BlockState blockState = new BlockState(-1, -1, 0);
   ```
   
   nit: I'd rather use 0 as a default value of the number of docs in the chunk?

##########
File path: lucene/test-framework/src/java/org/apache/lucene/index/BaseTermVectorsFormatTestCase.java
##########
@@ -667,43 +673,92 @@ public void testRandom() throws IOException {
     dir.close();
   }
 
-  public void testMerge() throws IOException {
+  private void doTestMerge(Sort indexSort) throws IOException {
     final RandomDocumentFactory docFactory = new RandomDocumentFactory(5, 20);
     final int numDocs = atLeast(100);
-    final int numDeletes = random().nextInt(numDocs);
-    final Set<Integer> deletes = new HashSet<>();
-    while (deletes.size() < numDeletes) {
-      deletes.add(random().nextInt(numDocs));
-    }
     for (Options options : validOptions()) {
-      final RandomDocument[] docs = new RandomDocument[numDocs];
+      Map<String, RandomDocument> docs = new HashMap<>();
       for (int i = 0; i < numDocs; ++i) {
-        docs[i] = docFactory.newDocument(TestUtil.nextInt(random(), 1, 3), atLeast(10), options);
+        docs.put(
+            Integer.toString(i),
+            docFactory.newDocument(TestUtil.nextInt(random(), 1, 3), atLeast(10), options));
       }
       final Directory dir = newDirectory();
-      final RandomIndexWriter writer = new RandomIndexWriter(random(), dir);
-      for (int i = 0; i < numDocs; ++i) {
-        writer.addDocument(addId(docs[i].toDocument(), "" + i));
+      final IndexWriterConfig iwc = newIndexWriterConfig();
+      if (indexSort != null) {
+        iwc.setIndexSort(indexSort);
+      }
+      final RandomIndexWriter writer = new RandomIndexWriter(random(), dir, iwc);
+      List<String> liveDocIDs = new ArrayList<>();
+      List<String> ids = new ArrayList<>(docs.keySet());
+      Collections.shuffle(ids, random());
+      Runnable verifyTermVectors =
+          () -> {
+            try (DirectoryReader reader = maybeWrapWithMergingReader(writer.getReader())) {
+              for (String id : liveDocIDs) {
+                final int docID = docID(reader, id);
+                assertEquals(docs.get(id), reader.getTermVectors(docID));
+              }
+            } catch (IOException e) {
+              throw new UncheckedIOException(e);
+            }
+          };
+      for (String id : ids) {
+        final Document doc = addId(docs.get(id).toDocument(), id);
+        if (indexSort != null) {
+          for (SortField sortField : indexSort.getSort()) {
+            doc.add(
+                new NumericDocValuesField(
+                    sortField.getField(), TestUtil.nextInt(random(), 0, 10000)));
+          }
+        }
+        if (random().nextInt(100) < 5) {
+          // add via foreign writer
+          IndexWriterConfig otherIwc = newIndexWriterConfig();
+          if (indexSort != null) {
+            otherIwc.setIndexSort(indexSort);
+          }
+          try (Directory otherDir = newDirectory();
+              RandomIndexWriter otherIw = new RandomIndexWriter(random(), otherDir, otherIwc)) {
+            otherIw.addDocument(doc);
+            try (DirectoryReader otherReader = otherIw.getReader()) {
+              TestUtil.addIndexesSlowly(writer.w, otherReader);
+            }
+          }
+        } else {
+          writer.addDocument(doc);
+        }
+        liveDocIDs.add(id);
+        if (random().nextInt(100) < 5) {
+          final String deleteId = liveDocIDs.remove(random().nextInt(liveDocIDs.size()));
+          writer.deleteDocuments(new Term("id", deleteId));
+        }

Review comment:
       it's a pre-existing issue, but it would be nice to test merging with deletes and merging without deletes separately




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] dnhatn merged pull request #140: LUCENE-9935: Enable bulk-merge for term vectors with index sort

Posted by GitBox <gi...@apache.org>.

dnhatn merged pull request #140:
URL: https://github.com/apache/lucene/pull/140


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] dnhatn commented on a change in pull request #140: LUCENE-9935: Enable bulk-merge for term vectors with index sort

Posted by GitBox <gi...@apache.org>.

dnhatn commented on a change in pull request #140:
URL: https://github.com/apache/lucene/pull/140#discussion_r648815647



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/compressing/Lucene90CompressingTermVectorsReader.java
##########
@@ -91,6 +91,7 @@
   private final long numDirtyChunks; // number of incomplete compressed blocks written
   private final long numDirtyDocs; // cumulative number of docs in incomplete chunks
   private final long maxPointer; // end of the data section
+  private BlockState blockState = new BlockState(-1, -1, -1);

Review comment:
       +1, fixed in https://github.com/apache/lucene/pull/140/commits/7e088c723c186bf863d5afe367179cd1a6d97949

##########
File path: lucene/test-framework/src/java/org/apache/lucene/index/BaseTermVectorsFormatTestCase.java
##########
@@ -667,43 +673,92 @@ public void testRandom() throws IOException {
     dir.close();
   }
 
-  public void testMerge() throws IOException {
+  private void doTestMerge(Sort indexSort) throws IOException {
     final RandomDocumentFactory docFactory = new RandomDocumentFactory(5, 20);
     final int numDocs = atLeast(100);
-    final int numDeletes = random().nextInt(numDocs);
-    final Set<Integer> deletes = new HashSet<>();
-    while (deletes.size() < numDeletes) {
-      deletes.add(random().nextInt(numDocs));
-    }
     for (Options options : validOptions()) {
-      final RandomDocument[] docs = new RandomDocument[numDocs];
+      Map<String, RandomDocument> docs = new HashMap<>();
       for (int i = 0; i < numDocs; ++i) {
-        docs[i] = docFactory.newDocument(TestUtil.nextInt(random(), 1, 3), atLeast(10), options);
+        docs.put(
+            Integer.toString(i),
+            docFactory.newDocument(TestUtil.nextInt(random(), 1, 3), atLeast(10), options));
       }
       final Directory dir = newDirectory();
-      final RandomIndexWriter writer = new RandomIndexWriter(random(), dir);
-      for (int i = 0; i < numDocs; ++i) {
-        writer.addDocument(addId(docs[i].toDocument(), "" + i));
+      final IndexWriterConfig iwc = newIndexWriterConfig();
+      if (indexSort != null) {
+        iwc.setIndexSort(indexSort);
+      }
+      final RandomIndexWriter writer = new RandomIndexWriter(random(), dir, iwc);
+      List<String> liveDocIDs = new ArrayList<>();
+      List<String> ids = new ArrayList<>(docs.keySet());
+      Collections.shuffle(ids, random());
+      Runnable verifyTermVectors =
+          () -> {
+            try (DirectoryReader reader = maybeWrapWithMergingReader(writer.getReader())) {
+              for (String id : liveDocIDs) {
+                final int docID = docID(reader, id);
+                assertEquals(docs.get(id), reader.getTermVectors(docID));
+              }
+            } catch (IOException e) {
+              throw new UncheckedIOException(e);
+            }
+          };
+      for (String id : ids) {
+        final Document doc = addId(docs.get(id).toDocument(), id);
+        if (indexSort != null) {
+          for (SortField sortField : indexSort.getSort()) {
+            doc.add(
+                new NumericDocValuesField(
+                    sortField.getField(), TestUtil.nextInt(random(), 0, 10000)));
+          }
+        }
+        if (random().nextInt(100) < 5) {
+          // add via foreign writer
+          IndexWriterConfig otherIwc = newIndexWriterConfig();
+          if (indexSort != null) {
+            otherIwc.setIndexSort(indexSort);
+          }
+          try (Directory otherDir = newDirectory();
+              RandomIndexWriter otherIw = new RandomIndexWriter(random(), otherDir, otherIwc)) {
+            otherIw.addDocument(doc);
+            try (DirectoryReader otherReader = otherIw.getReader()) {
+              TestUtil.addIndexesSlowly(writer.w, otherReader);
+            }
+          }
+        } else {
+          writer.addDocument(doc);
+        }
+        liveDocIDs.add(id);
+        if (random().nextInt(100) < 5) {
+          final String deleteId = liveDocIDs.remove(random().nextInt(liveDocIDs.size()));
+          writer.deleteDocuments(new Term("id", deleteId));
+        }

Review comment:
       I pushed https://github.com/apache/lucene/pull/140/commits/40a079caf968d135d202ae74069a8dd671683bde




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org