You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/08/26 12:30:15 UTC

[GitHub] [lucene] vsop-479 opened a new pull request, #11722: Binary search the entries when all suffixes have the same length in a leaf block.

vsop-479 opened a new pull request, #11722:
URL: https://github.com/apache/lucene/pull/11722

   Binary search may have a better performance, when all suffixes have the same length in a leaf block.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on a diff in pull request #11722: Binary search the entries when all suffixes have the same length in a leaf block.

Posted by GitBox <gi...@apache.org>.
jpountz commented on code in PR #11722:
URL: https://github.com/apache/lucene/pull/11722#discussion_r982050404


##########
lucene/test-framework/src/java/org/apache/lucene/tests/index/BasePostingsFormatTestCase.java:
##########
@@ -367,6 +367,53 @@ public void testGhosts() throws Exception {
     dir.close();
   }
 
+  public void testBinarySearchTermLeaf() throws Exception {
+    Directory dir = newDirectory();
+
+    IndexWriterConfig iwc = newIndexWriterConfig(null);
+    iwc.setCodec(getCodec());
+    iwc.setMergePolicy(newTieredMergePolicy());
+    IndexWriter iw = new IndexWriter(dir, iwc);
+
+    for (int i = 100000; i <= 100400; i++){
+      // only add odd number
+      if (i % 2 == 1) {
+        Document document = new Document();
+        document.add(new StringField("id", i + "", Field.Store.NO));
+        iw.addDocument(document);
+      }
+    }
+    iw.commit();
+    iw.forceMerge(1);
+
+    DirectoryReader reader = DirectoryReader.open(iw);
+    TermsEnum termsEnum = getOnlyLeafReader(reader).terms("id").iterator();
+    // test seekExact
+    for (int i = 100000; i <= 100400; i++){
+      BytesRef target = new BytesRef(i + "");
+      if (i % 2 == 1) {
+        assertTrue(termsEnum.seekExact(target));
+        assertEquals(termsEnum.term(), target);
+      } else {
+        assertFalse(termsEnum.seekExact(target));
+      }
+    }
+    // test seekCeil
+    for (int i = 100000; i < 100400; i++){
+      BytesRef target = new BytesRef(i + "");
+      if (i % 2 == 1) {
+        assertEquals(SeekStatus.FOUND, termsEnum.seekCeil(target));
+        assertEquals(termsEnum.term(), target);
+      } else {
+        assertEquals(SeekStatus.NOT_FOUND, termsEnum.seekCeil(target));

Review Comment:
   maybe also assert here that the terms enum is positioned on the next term?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #11722: Binary search the entries when all suffixes have the same length in a leaf block.

Posted by GitBox <gi...@apache.org>.
jpountz commented on PR #11722:
URL: https://github.com/apache/lucene/pull/11722#issuecomment-1282305919

   I had to revert this change because of test failures, e.g. this seed reproduces on the main branch:
   
   ```
   gradlew test --tests TestNumericDocValuesUpdates.testSortedIndex -Dtests.seed=4C6E977E1F29E069 -Dtests.locale=khq -Dtests.timezone=Asia/Hong_Kong -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   ```
   
   It seems like this test exercises some logic that `BasePostingsFormatTestCase` doesn't but I haven't figured out what yet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] vsop-479 commented on pull request #11722: Binary search the entries when all suffixes have the same length in a leaf block.

Posted by GitBox <gi...@apache.org>.
vsop-479 commented on PR #11722:
URL: https://github.com/apache/lucene/pull/11722#issuecomment-1256739176

   
   > 200 fixed-size IDs and we'd make sure that the binary search works as expected for both `seekCeil` and `seekExact` for every of these 200 terms as well as other terms that compare less than all terms from the dict, greater than all terms of the dict, or are between two terms that exist in the dict?
   
   Got it. I think it is a good idea for a unit test.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #11722: Binary search the entries when all suffixes have the same length in a leaf block.

Posted by GitBox <gi...@apache.org>.
jpountz commented on PR #11722:
URL: https://github.com/apache/lucene/pull/11722#issuecomment-1256184937

   > I may add this test case to BasePostingsFormatTestCase, or do you have any other idea on test?
   
   1M documents is too much for a unit test, I was thinking of a smaller dataset, e.g. 200 fixed-size IDs and we'd make sure that the binary search works as expected for both `seekCeil` and `seekExact` for every of these 200 terms as well as other terms that compare less than all terms from the dict, greater than all terms of the dict, or are between two terms that exist in the dict?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] vsop-479 commented on pull request #11722: Binary search the entries when all suffixes have the same length in a leaf block.

Posted by GitBox <gi...@apache.org>.
vsop-479 commented on PR #11722:
URL: https://github.com/apache/lucene/pull/11722#issuecomment-1255837607

   @jpountz 
   Thanks for your review.
   I did a simple performance test, which indexed 1M random UUID's substring(2, 8), got 10 segments, and picked up 1K terms to search. Average Result of 4 times tests: 
   
   Method took:
   baseline(scanToTermLeaf)ns | candidate(binarySearchTermLeaf)ns | speedup
   -- | -- |--
   5,757,121.5 | 4,761,325.5 | 20.9%
   
   Whole search took:
   baseline(scan)ns | candidate(binarySearch)ns | speedup
   -- | -- |--
   162,668,448 | 157,990,611 | 2.9%
   
   In my test case, scanToTerm only took 3.5% of the whole search execute time, so it could only got a small speedup.
   I may add this test case to BasePostingsFormatTestCase, or do you have any other idea on test?
   I willl update the comment, please have a review. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] vsop-479 commented on pull request #11722: Binary search the entries when all suffixes have the same length in a leaf block.

Posted by GitBox <gi...@apache.org>.
vsop-479 commented on PR #11722:
URL: https://github.com/apache/lucene/pull/11722#issuecomment-1294803341

   @jpountz I have fixed the test failures. 
   But I cant find the new commit in this PR (since it has been merged and reverted?), do I need to open an new PR or some what to show the new commit?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #11722: Binary search the entries when all suffixes have the same length in a leaf block.

Posted by GitBox <gi...@apache.org>.
jpountz commented on PR #11722:
URL: https://github.com/apache/lucene/pull/11722#issuecomment-1294864473

   A new PR would be great, thank you for looking into these failures!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz merged pull request #11722: Binary search the entries when all suffixes have the same length in a leaf block.

Posted by GitBox <gi...@apache.org>.
jpountz merged PR #11722:
URL: https://github.com/apache/lucene/pull/11722


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on a diff in pull request #11722: Binary search the entries when all suffixes have the same length in a leaf block.

Posted by GitBox <gi...@apache.org>.
jpountz commented on code in PR #11722:
URL: https://github.com/apache/lucene/pull/11722#discussion_r977400678


##########
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java:
##########
@@ -646,6 +648,84 @@ public SeekStatus scanToTermLeaf(BytesRef target, boolean exactOnly) throws IOEx
     return SeekStatus.END;
   }
 
+  // Target's prefix matches this block's prefix;
+  // And all suffixes have the same length in this block,
+  // we binary search the entries check if the suffix matches.
+  public SeekStatus binarySearchTermLeaf(BytesRef target, boolean exactOnly) throws IOException {
+    // if (DEBUG) System.out.println("    binarySearchTermLeaf: block fp=" + fp + " prefix=" + prefix + "
+    // nextEnt=" + nextEnt + " (of " + entCount + ") target=" + brToString(target) + " term=" +
+    // brToString(term));
+
+    assert nextEnt != -1;
+
+    ste.termExists = true;
+    subCode = 0;
+
+    if (nextEnt == entCount) {
+      if (exactOnly) {
+        fillTerm();
+      }
+      return SeekStatus.END;
+    }
+
+    assert prefixMatches(target);
+
+    suffix = suffixLengthsReader.readVInt();
+    int start = nextEnt;
+    int end = entCount - 1;
+    //Binary search the entries (terms) in this leaf block:
+    while (start <= end) {
+      int mid = (start + end) / 2;
+      nextEnt = mid + 1;
+      startBytePos = mid * suffix;
+      // Loop over bytes in the suffix, comparing to the target

Review Comment:
   Maybe update the comment, it's no longer a loop?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] vsop-479 commented on pull request #11722: Binary search the entries when all suffixes have the same length in a leaf block.

Posted by GitBox <gi...@apache.org>.
vsop-479 commented on PR #11722:
URL: https://github.com/apache/lucene/pull/11722#issuecomment-1258003910

   @jpountz 
   I added a test to BasePostingsFormatTestCase. Please have a review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on a diff in pull request #11722: Binary search the entries when all suffixes have the same length in a leaf block.

Posted by GitBox <gi...@apache.org>.
jpountz commented on code in PR #11722:
URL: https://github.com/apache/lucene/pull/11722#discussion_r981512784


##########
lucene/test-framework/src/java/org/apache/lucene/tests/index/BasePostingsFormatTestCase.java:
##########
@@ -367,6 +367,49 @@ public void testGhosts() throws Exception {
     dir.close();
   }
 
+  public void testBinarySearchTermLeaf() throws Exception {
+    Directory dir = newDirectory();
+
+    IndexWriterConfig iwc = newIndexWriterConfig(null);
+    iwc.setCodec(getCodec());
+    iwc.setMergePolicy(newTieredMergePolicy());
+    IndexWriter iw = new IndexWriter(dir, iwc);
+
+    for (int i = 100000; i <= 100400; i++){
+      // only add odd number
+      if (i % 2 == 1) {
+        Document document = new Document();
+        document.add(new StringField("id", i + "", Field.Store.NO));
+        iw.addDocument(document);
+      }
+    }
+    iw.commit();
+    iw.forceMerge(1);
+
+    DirectoryReader reader = DirectoryReader.open(iw);
+    TermsEnum termsEnum = getOnlyLeafReader(reader).terms("id").iterator();
+    // test seekExact
+    for (int i = 100000; i <= 100400; i++){
+      if (i % 2 == 1) {
+        assertTrue(termsEnum.seekExact(new BytesRef(i + "")));

Review Comment:
   maybe assert that the value of `termsEnum.term()` is correct after this call



##########
lucene/test-framework/src/java/org/apache/lucene/tests/index/BasePostingsFormatTestCase.java:
##########
@@ -367,6 +367,49 @@ public void testGhosts() throws Exception {
     dir.close();
   }
 
+  public void testBinarySearchTermLeaf() throws Exception {
+    Directory dir = newDirectory();
+
+    IndexWriterConfig iwc = newIndexWriterConfig(null);
+    iwc.setCodec(getCodec());
+    iwc.setMergePolicy(newTieredMergePolicy());
+    IndexWriter iw = new IndexWriter(dir, iwc);
+
+    for (int i = 100000; i <= 100400; i++){
+      // only add odd number
+      if (i % 2 == 1) {
+        Document document = new Document();
+        document.add(new StringField("id", i + "", Field.Store.NO));
+        iw.addDocument(document);
+      }
+    }
+    iw.commit();
+    iw.forceMerge(1);
+
+    DirectoryReader reader = DirectoryReader.open(iw);
+    TermsEnum termsEnum = getOnlyLeafReader(reader).terms("id").iterator();
+    // test seekExact
+    for (int i = 100000; i <= 100400; i++){
+      if (i % 2 == 1) {
+        assertTrue(termsEnum.seekExact(new BytesRef(i + "")));
+      } else {
+        assertFalse(termsEnum.seekExact(new BytesRef(i + "")));
+      }
+    }
+    // test seekCeil
+    for (int i = 100000; i < 100400; i++){
+      if (i % 2 == 1) {
+        assertEquals(SeekStatus.FOUND, termsEnum.seekCeil(new BytesRef(i + "")));

Review Comment:
   maybe assert that the value of `termsEnum.term()` is correct after this call



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] vsop-479 commented on pull request #11722: Binary search the entries when all suffixes have the same length in a leaf block.

Posted by GitBox <gi...@apache.org>.
vsop-479 commented on PR #11722:
URL: https://github.com/apache/lucene/pull/11722#issuecomment-1260490249

   @jpountz Thranks for your review and suggestion. I have added a CHANGES entry and assert term value code. 
   Please have a review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #11722: Binary search the entries when all suffixes have the same length in a leaf block.

Posted by GitBox <gi...@apache.org>.
jpountz commented on PR #11722:
URL: https://github.com/apache/lucene/pull/11722#issuecomment-1282060420

   @mikemccand You might want to have a look at this change since (I think) you are one of the most familiar ones with the original code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org