You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/01/18 10:44:15 UTC

[GitHub] [lucene-solr] codaitya opened a new pull request #2214: LUCENE-9674:Faster advance on Vector Values

codaitya opened a new pull request #2214:
URL: https://github.com/apache/lucene-solr/pull/2214


   Currently the advance() function in the class Lucene90VectorReader does a
   linear search for the target document.
   This will make retrieving vectors for a sparse set of documents efficient.
   
   <!--
   _(If you are a project committer then you may remove some/all of the following template.)_
   
   Before creating a pull request, please file an issue in the ASF Jira system for Lucene or Solr:
   
   * https://issues.apache.org/jira/projects/LUCENE
   * https://issues.apache.org/jira/projects/SOLR
   
   You will need to create an account in Jira in order to create an issue.
   
   The title of the PR should reference the Jira issue number in the form:
   
   * LUCENE-####: <short description of problem or changes>
   * SOLR-####: <short description of problem or changes>
   
   LUCENE and SOLR must be fully capitalized. A short description helps people scanning pull requests for items they can work on.
   
   Properly referencing the issue in the title ensures that Jira is correctly updated with code review comments and commits. -->
   
   
   # Description
   
   Currently the advance() function in the class Lucene90VectorReader does a
   linear search for the target document. This can be an expensive operation if we are searching for
   sparse documents having vector fields.
   
   # Solution
   
   Implement a binary search over the "ordToDoc" array which will make the advance operation
   take logarithmic time to search.
   
   # Tests
   
   Added testAdvance() in class TestVectorValues. It creates an index with gaps for vector fields and randomly calls advance.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request title.
   - [x] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `master` branch.
   - [x] I have run `./gradlew check`.
   - [x] I have added tests for my changes.
   - [ ] I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] codaitya commented on a change in pull request #2214: LUCENE-9674:Faster advance on Vector Values

Posted by GitBox <gi...@apache.org>.

codaitya commented on a change in pull request #2214:
URL: https://github.com/apache/lucene-solr/pull/2214#discussion_r560716737



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90VectorReader.java
##########
@@ -386,9 +387,18 @@ public int nextDoc() {
     }
 
     @Override
-    public int advance(int target) throws IOException {
-      // We could do better by log-binary search in ordToDoc, but this is never used
-      return slowAdvance(target);
+    public int advance(int target) {
+      assert docID() < target;
+      ord = Arrays.binarySearch(fieldEntry.ordToDoc, ord + 1, fieldEntry.ordToDoc.length, target);
+      if (ord < 0) {
+        ord = -(ord + 1);
+      }
+      if (ord == fieldEntry.ordToDoc.length) {

Review comment:
       I think the range of values for `ord` will be `[0,fieldEntry.ordToDoc.length]`, so we dont need a `>` check ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] codaitya commented on a change in pull request #2214: LUCENE-9674:Faster advance on Vector Values

Posted by GitBox <gi...@apache.org>.

codaitya commented on a change in pull request #2214:
URL: https://github.com/apache/lucene-solr/pull/2214#discussion_r561092241



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90VectorReader.java
##########
@@ -386,9 +387,18 @@ public int nextDoc() {
     }
 
     @Override
-    public int advance(int target) throws IOException {
-      // We could do better by log-binary search in ordToDoc, but this is never used
-      return slowAdvance(target);
+    public int advance(int target) {
+      assert docID() < target;
+      ord = Arrays.binarySearch(fieldEntry.ordToDoc, ord + 1, fieldEntry.ordToDoc.length, target);
+      if (ord < 0) {
+        ord = -(ord + 1);
+      }
+      if (ord == fieldEntry.ordToDoc.length) {

Review comment:
       sure, will add a check.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] codaitya commented on a change in pull request #2214: LUCENE-9674:Faster advance on Vector Values

Posted by GitBox <gi...@apache.org>.

codaitya commented on a change in pull request #2214:
URL: https://github.com/apache/lucene-solr/pull/2214#discussion_r561092241



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90VectorReader.java
##########
@@ -386,9 +387,18 @@ public int nextDoc() {
     }
 
     @Override
-    public int advance(int target) throws IOException {
-      // We could do better by log-binary search in ordToDoc, but this is never used
-      return slowAdvance(target);
+    public int advance(int target) {
+      assert docID() < target;
+      ord = Arrays.binarySearch(fieldEntry.ordToDoc, ord + 1, fieldEntry.ordToDoc.length, target);
+      if (ord < 0) {
+        ord = -(ord + 1);
+      }
+      if (ord == fieldEntry.ordToDoc.length) {

Review comment:
       sure, will add an assert check.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] codaitya commented on a change in pull request #2214: LUCENE-9674:Faster advance on Vector Values

Posted by GitBox <gi...@apache.org>.

codaitya commented on a change in pull request #2214:
URL: https://github.com/apache/lucene-solr/pull/2214#discussion_r560719085



##########
File path: lucene/core/src/test/org/apache/lucene/index/TestVectorValues.java
##########
@@ -815,4 +816,47 @@ public void testSearchStrategyIdentifiers() {
     assertEquals(2, VectorValues.SearchStrategy.DOT_PRODUCT_HNSW.ordinal());
     assertEquals(3, VectorValues.SearchStrategy.values().length);
   }
+
+  public void testAdvance() throws Exception {
+    try (Directory dir = newDirectory()) {
+      try (IndexWriter w = new IndexWriter(dir, createIndexWriterConfig())) {
+        int numdocs = 2000;
+        String fieldName = "field";
+        for (int i = 0; i < numdocs; i++) {
+          Document doc = new Document();
+          // randomly add a vector field
+          if (random().nextInt(4) == 3) {
+            doc.add(
+                new VectorField(
+                    fieldName, new float[4], VectorValues.SearchStrategy.DOT_PRODUCT_HNSW));
+          }
+          w.addDocument(doc);
+        }
+        w.forceMerge(1);
+        try (IndexReader reader = w.getReader()) {
+          LeafReader r = getOnlyLeafReader(reader);
+          VectorValues vectorValues = r.getVectorValues(fieldName);
+          TreeSet<Integer> vectorDocs = new TreeSet<>();
+          int cur = -1;
+          while (cur != NO_MORE_DOCS) {
+            cur = vectorValues.nextDoc();
+            vectorDocs.add(cur);
+          }
+          vectorValues = r.getVectorValues(fieldName);
+          for (int i = 0; i != NO_MORE_DOCS && i < numdocs; i++) {
+            // randomly advance to i
+            if (random().nextInt(4) == 3) {
+              vectorValues.advance(i);
+              assertEquals((int) vectorDocs.higher(i - 1), vectorValues.docID());
+              if (vectorValues.docID() == NO_MORE_DOCS) {
+                break;
+              }
+              // ensure i is always greater than vectorValues.docID()
+              i = vectorValues.docID();

Review comment:
       The intention was that in the next loop iteration `i` wil be incremented,  but yes this confusing. I will update




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] s1monw commented on a change in pull request #2214: LUCENE-9674:Faster advance on Vector Values

Posted by GitBox <gi...@apache.org>.

s1monw commented on a change in pull request #2214:
URL: https://github.com/apache/lucene-solr/pull/2214#discussion_r560412822



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90VectorReader.java
##########
@@ -386,9 +387,18 @@ public int nextDoc() {
     }
 
     @Override
-    public int advance(int target) throws IOException {
-      // We could do better by log-binary search in ordToDoc, but this is never used
-      return slowAdvance(target);
+    public int advance(int target) {
+      assert docID() < target;
+      ord = Arrays.binarySearch(fieldEntry.ordToDoc, ord + 1, fieldEntry.ordToDoc.length, target);
+      if (ord < 0) {
+        ord = -(ord + 1);
+      }
+      if (ord == fieldEntry.ordToDoc.length) {

Review comment:
       should this read `ord >= fieldEntry.ordToDoc.length` to make sure we never grow beyond?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] msokolov merged pull request #2214: LUCENE-9674:Faster advance on Vector Values

Posted by GitBox <gi...@apache.org>.

msokolov merged pull request #2214:
URL: https://github.com/apache/lucene-solr/pull/2214


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] s1monw commented on a change in pull request #2214: LUCENE-9674:Faster advance on Vector Values

Posted by GitBox <gi...@apache.org>.

s1monw commented on a change in pull request #2214:
URL: https://github.com/apache/lucene-solr/pull/2214#discussion_r560189606



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90VectorReader.java
##########
@@ -387,8 +387,40 @@ public int nextDoc() {
 
     @Override
     public int advance(int target) throws IOException {
-      // We could do better by log-binary search in ordToDoc, but this is never used
-      return slowAdvance(target);
+      assert docID() < target;
+      int idx = binarySearch(target);

Review comment:
       I wonder if we can simply use `Arrays.binarySearch(fieldEntry.ordToDoc, ord + 1 ,fieldEntry.ordToDoc.length - 1, target);` here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] codaitya commented on a change in pull request #2214: LUCENE-9674:Faster advance on Vector Values

Posted by GitBox <gi...@apache.org>.

codaitya commented on a change in pull request #2214:
URL: https://github.com/apache/lucene-solr/pull/2214#discussion_r560295398



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90VectorReader.java
##########
@@ -387,8 +387,40 @@ public int nextDoc() {
 
     @Override
     public int advance(int target) throws IOException {
-      // We could do better by log-binary search in ordToDoc, but this is never used
-      return slowAdvance(target);
+      assert docID() < target;
+      int idx = binarySearch(target);

Review comment:
       Yes I like the idea of not maintaining our own binary search. Updated the PR.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] msokolov commented on a change in pull request #2214: LUCENE-9674:Faster advance on Vector Values

Posted by GitBox <gi...@apache.org>.

msokolov commented on a change in pull request #2214:
URL: https://github.com/apache/lucene-solr/pull/2214#discussion_r560546645



##########
File path: lucene/core/src/test/org/apache/lucene/index/TestVectorValues.java
##########
@@ -815,4 +816,47 @@ public void testSearchStrategyIdentifiers() {
     assertEquals(2, VectorValues.SearchStrategy.DOT_PRODUCT_HNSW.ordinal());
     assertEquals(3, VectorValues.SearchStrategy.values().length);
   }
+
+  public void testAdvance() throws Exception {
+    try (Directory dir = newDirectory()) {
+      try (IndexWriter w = new IndexWriter(dir, createIndexWriterConfig())) {
+        int numdocs = 2000;
+        String fieldName = "field";
+        for (int i = 0; i < numdocs; i++) {
+          Document doc = new Document();
+          // randomly add a vector field
+          if (random().nextInt(4) == 3) {
+            doc.add(
+                new VectorField(
+                    fieldName, new float[4], VectorValues.SearchStrategy.DOT_PRODUCT_HNSW));

Review comment:
       Maybe use SearchStrategy.NONE here so we don't bother building a graph for this test case, which works the same with or without

##########
File path: lucene/core/src/test/org/apache/lucene/index/TestVectorValues.java
##########
@@ -815,4 +816,47 @@ public void testSearchStrategyIdentifiers() {
     assertEquals(2, VectorValues.SearchStrategy.DOT_PRODUCT_HNSW.ordinal());
     assertEquals(3, VectorValues.SearchStrategy.values().length);
   }
+
+  public void testAdvance() throws Exception {
+    try (Directory dir = newDirectory()) {
+      try (IndexWriter w = new IndexWriter(dir, createIndexWriterConfig())) {
+        int numdocs = 2000;

Review comment:
        `atLeast(1000)` will randomize the number of docs

##########
File path: lucene/core/src/test/org/apache/lucene/index/TestVectorValues.java
##########
@@ -815,4 +816,47 @@ public void testSearchStrategyIdentifiers() {
     assertEquals(2, VectorValues.SearchStrategy.DOT_PRODUCT_HNSW.ordinal());
     assertEquals(3, VectorValues.SearchStrategy.values().length);
   }
+
+  public void testAdvance() throws Exception {
+    try (Directory dir = newDirectory()) {
+      try (IndexWriter w = new IndexWriter(dir, createIndexWriterConfig())) {
+        int numdocs = 2000;
+        String fieldName = "field";
+        for (int i = 0; i < numdocs; i++) {
+          Document doc = new Document();
+          // randomly add a vector field
+          if (random().nextInt(4) == 3) {
+            doc.add(
+                new VectorField(
+                    fieldName, new float[4], VectorValues.SearchStrategy.DOT_PRODUCT_HNSW));
+          }
+          w.addDocument(doc);
+        }
+        w.forceMerge(1);
+        try (IndexReader reader = w.getReader()) {
+          LeafReader r = getOnlyLeafReader(reader);
+          VectorValues vectorValues = r.getVectorValues(fieldName);
+          TreeSet<Integer> vectorDocs = new TreeSet<>();

Review comment:
       could we store the values in an `int[]` here? They are guaranteed to be returned in increasing order by `nextDoc` (although asserting that is so would be good). Then below we can maintain an index into the array and advance by one to get the next expected docid.

##########
File path: lucene/core/src/test/org/apache/lucene/index/TestVectorValues.java
##########
@@ -815,4 +816,47 @@ public void testSearchStrategyIdentifiers() {
     assertEquals(2, VectorValues.SearchStrategy.DOT_PRODUCT_HNSW.ordinal());
     assertEquals(3, VectorValues.SearchStrategy.values().length);
   }
+
+  public void testAdvance() throws Exception {
+    try (Directory dir = newDirectory()) {
+      try (IndexWriter w = new IndexWriter(dir, createIndexWriterConfig())) {
+        int numdocs = 2000;
+        String fieldName = "field";
+        for (int i = 0; i < numdocs; i++) {
+          Document doc = new Document();
+          // randomly add a vector field
+          if (random().nextInt(4) == 3) {
+            doc.add(
+                new VectorField(
+                    fieldName, new float[4], VectorValues.SearchStrategy.DOT_PRODUCT_HNSW));
+          }
+          w.addDocument(doc);
+        }
+        w.forceMerge(1);
+        try (IndexReader reader = w.getReader()) {
+          LeafReader r = getOnlyLeafReader(reader);
+          VectorValues vectorValues = r.getVectorValues(fieldName);
+          TreeSet<Integer> vectorDocs = new TreeSet<>();
+          int cur = -1;
+          while (cur != NO_MORE_DOCS) {
+            cur = vectorValues.nextDoc();
+            vectorDocs.add(cur);
+          }
+          vectorValues = r.getVectorValues(fieldName);
+          for (int i = 0; i != NO_MORE_DOCS && i < numdocs; i++) {
+            // randomly advance to i
+            if (random().nextInt(4) == 3) {
+              vectorValues.advance(i);
+              assertEquals((int) vectorDocs.higher(i - 1), vectorValues.docID());
+              if (vectorValues.docID() == NO_MORE_DOCS) {
+                break;
+              }
+              // ensure i is always greater than vectorValues.docID()
+              i = vectorValues.docID();

Review comment:
       The comment above is confusing since i is manifestly *not* greater than vectorValues.docID() , it's equal!

##########
File path: lucene/core/src/test/org/apache/lucene/index/TestVectorValues.java
##########
@@ -815,4 +816,47 @@ public void testSearchStrategyIdentifiers() {
     assertEquals(2, VectorValues.SearchStrategy.DOT_PRODUCT_HNSW.ordinal());
     assertEquals(3, VectorValues.SearchStrategy.values().length);
   }
+
+  public void testAdvance() throws Exception {
+    try (Directory dir = newDirectory()) {
+      try (IndexWriter w = new IndexWriter(dir, createIndexWriterConfig())) {
+        int numdocs = 2000;
+        String fieldName = "field";
+        for (int i = 0; i < numdocs; i++) {
+          Document doc = new Document();
+          // randomly add a vector field
+          if (random().nextInt(4) == 3) {
+            doc.add(
+                new VectorField(
+                    fieldName, new float[4], VectorValues.SearchStrategy.DOT_PRODUCT_HNSW));
+          }
+          w.addDocument(doc);
+        }
+        w.forceMerge(1);
+        try (IndexReader reader = w.getReader()) {
+          LeafReader r = getOnlyLeafReader(reader);
+          VectorValues vectorValues = r.getVectorValues(fieldName);
+          TreeSet<Integer> vectorDocs = new TreeSet<>();
+          int cur = -1;
+          while (cur != NO_MORE_DOCS) {
+            cur = vectorValues.nextDoc();
+            vectorDocs.add(cur);
+          }
+          vectorValues = r.getVectorValues(fieldName);
+          for (int i = 0; i != NO_MORE_DOCS && i < numdocs; i++) {
+            // randomly advance to i
+            if (random().nextInt(4) == 3) {
+              vectorValues.advance(i);

Review comment:
       let's assert that the return value of `advance` is what we expect it to be




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] codaitya commented on a change in pull request #2214: LUCENE-9674:Faster advance on Vector Values

Posted by GitBox <gi...@apache.org>.

codaitya commented on a change in pull request #2214:
URL: https://github.com/apache/lucene-solr/pull/2214#discussion_r560716737



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90VectorReader.java
##########
@@ -386,9 +387,18 @@ public int nextDoc() {
     }
 
     @Override
-    public int advance(int target) throws IOException {
-      // We could do better by log-binary search in ordToDoc, but this is never used
-      return slowAdvance(target);
+    public int advance(int target) {
+      assert docID() < target;
+      ord = Arrays.binarySearch(fieldEntry.ordToDoc, ord + 1, fieldEntry.ordToDoc.length, target);
+      if (ord < 0) {
+        ord = -(ord + 1);
+      }
+      if (ord == fieldEntry.ordToDoc.length) {

Review comment:
       I think the range of values for `ord` will be between `[0,fieldEntry.ordToDoc.length]`, so we dont need a `>` check ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] msokolov edited a comment on pull request #2214: LUCENE-9674:Faster advance on Vector Values

Posted by GitBox <gi...@apache.org>.

msokolov edited a comment on pull request #2214:
URL: https://github.com/apache/lucene-solr/pull/2214#issuecomment-764702532


   >  It works a bit better if you push a new commit for each revision rather than force-push.
   
   Oh hmm, I see you did do that! I;m not sure why I thought you had force-pushed... never mind!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] codaitya commented on a change in pull request #2214: LUCENE-9674:Faster advance on Vector Values

Posted by GitBox <gi...@apache.org>.

codaitya commented on a change in pull request #2214:
URL: https://github.com/apache/lucene-solr/pull/2214#discussion_r560315821



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90VectorReader.java
##########
@@ -386,9 +387,18 @@ public int nextDoc() {
     }
 
     @Override
-    public int advance(int target) throws IOException {
-      // We could do better by log-binary search in ordToDoc, but this is never used
-      return slowAdvance(target);
+    public int advance(int target) {
+      assert docID() < target;
+      ord = Arrays.binarySearch(fieldEntry.ordToDoc, target);

Review comment:
       I should have used `Arrays.binarySearch(fieldEntry.ordToDoc,  ord + 1,  fieldEntry.ordToDoc.length - 1,  target)`. Will update




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] codaitya commented on a change in pull request #2214: LUCENE-9674:Faster advance on Vector Values

Posted by GitBox <gi...@apache.org>.

codaitya commented on a change in pull request #2214:
URL: https://github.com/apache/lucene-solr/pull/2214#discussion_r560718708



##########
File path: lucene/core/src/test/org/apache/lucene/index/TestVectorValues.java
##########
@@ -815,4 +816,47 @@ public void testSearchStrategyIdentifiers() {
     assertEquals(2, VectorValues.SearchStrategy.DOT_PRODUCT_HNSW.ordinal());
     assertEquals(3, VectorValues.SearchStrategy.values().length);
   }
+
+  public void testAdvance() throws Exception {
+    try (Directory dir = newDirectory()) {
+      try (IndexWriter w = new IndexWriter(dir, createIndexWriterConfig())) {
+        int numdocs = 2000;
+        String fieldName = "field";
+        for (int i = 0; i < numdocs; i++) {
+          Document doc = new Document();
+          // randomly add a vector field
+          if (random().nextInt(4) == 3) {
+            doc.add(
+                new VectorField(
+                    fieldName, new float[4], VectorValues.SearchStrategy.DOT_PRODUCT_HNSW));

Review comment:
       Sure




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] codaitya commented on a change in pull request #2214: LUCENE-9674:Faster advance on Vector Values

Posted by GitBox <gi...@apache.org>.

codaitya commented on a change in pull request #2214:
URL: https://github.com/apache/lucene-solr/pull/2214#discussion_r560718485



##########
File path: lucene/core/src/test/org/apache/lucene/index/TestVectorValues.java
##########
@@ -815,4 +816,47 @@ public void testSearchStrategyIdentifiers() {
     assertEquals(2, VectorValues.SearchStrategy.DOT_PRODUCT_HNSW.ordinal());
     assertEquals(3, VectorValues.SearchStrategy.values().length);
   }
+
+  public void testAdvance() throws Exception {
+    try (Directory dir = newDirectory()) {
+      try (IndexWriter w = new IndexWriter(dir, createIndexWriterConfig())) {
+        int numdocs = 2000;
+        String fieldName = "field";
+        for (int i = 0; i < numdocs; i++) {
+          Document doc = new Document();
+          // randomly add a vector field
+          if (random().nextInt(4) == 3) {
+            doc.add(
+                new VectorField(
+                    fieldName, new float[4], VectorValues.SearchStrategy.DOT_PRODUCT_HNSW));
+          }
+          w.addDocument(doc);
+        }
+        w.forceMerge(1);
+        try (IndexReader reader = w.getReader()) {
+          LeafReader r = getOnlyLeafReader(reader);
+          VectorValues vectorValues = r.getVectorValues(fieldName);
+          TreeSet<Integer> vectorDocs = new TreeSet<>();
+          int cur = -1;
+          while (cur != NO_MORE_DOCS) {
+            cur = vectorValues.nextDoc();
+            vectorDocs.add(cur);
+          }
+          vectorValues = r.getVectorValues(fieldName);
+          for (int i = 0; i != NO_MORE_DOCS && i < numdocs; i++) {
+            // randomly advance to i
+            if (random().nextInt(4) == 3) {
+              vectorValues.advance(i);

Review comment:
       sure




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] msokolov commented on a change in pull request #2214: LUCENE-9674:Faster advance on Vector Values

Posted by GitBox <gi...@apache.org>.

msokolov commented on a change in pull request #2214:
URL: https://github.com/apache/lucene-solr/pull/2214#discussion_r561041920



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90VectorReader.java
##########
@@ -386,9 +387,18 @@ public int nextDoc() {
     }
 
     @Override
-    public int advance(int target) throws IOException {
-      // We could do better by log-binary search in ordToDoc, but this is never used
-      return slowAdvance(target);
+    public int advance(int target) {
+      assert docID() < target;
+      ord = Arrays.binarySearch(fieldEntry.ordToDoc, ord + 1, fieldEntry.ordToDoc.length, target);
+      if (ord < 0) {
+        ord = -(ord + 1);
+      }
+      if (ord == fieldEntry.ordToDoc.length) {

Review comment:
       I guess it could save us from some mistake. We could add an `assert ord <= fieldEntry.ordToDoc.length`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] msokolov edited a comment on pull request #2214: LUCENE-9674:Faster advance on Vector Values

Posted by GitBox <gi...@apache.org>.

msokolov edited a comment on pull request #2214:
URL: https://github.com/apache/lucene-solr/pull/2214#issuecomment-764702532


   >  It works a bit better if you push a new commit for each revision rather than force-push.
   
   Oh hmm, I see you did do that! I;m not sure why I thought you had force-pushed... never mind!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] s1monw commented on a change in pull request #2214: LUCENE-9674:Faster advance on Vector Values

Posted by GitBox <gi...@apache.org>.

s1monw commented on a change in pull request #2214:
URL: https://github.com/apache/lucene-solr/pull/2214#discussion_r560189755



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90VectorReader.java
##########
@@ -387,8 +387,40 @@ public int nextDoc() {
 
     @Override
     public int advance(int target) throws IOException {
-      // We could do better by log-binary search in ordToDoc, but this is never used
-      return slowAdvance(target);
+      assert docID() < target;
+      int idx = binarySearch(target);

Review comment:
       I wonder if we can simply use `Arrays.binarySearch(fieldEntry.ordToDoc, ord + 1 ,fieldEntry.ordToDoc.length - 1, target);` here?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] msokolov commented on pull request #2214: LUCENE-9674:Faster advance on Vector Values

Posted by GitBox <gi...@apache.org>.

msokolov commented on pull request #2214:
URL: https://github.com/apache/lucene-solr/pull/2214#issuecomment-764702532


   >  It works a bit better if you push a new commit for each revision rather than force-push.
   Oh hmm, I see you did do that! I;m not sure why I thought you had force-pushed... never mind!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] msokolov commented on a change in pull request #2214: LUCENE-9674:Faster advance on Vector Values

Posted by GitBox <gi...@apache.org>.

msokolov commented on a change in pull request #2214:
URL: https://github.com/apache/lucene-solr/pull/2214#discussion_r560234117



##########
File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90VectorReader.java
##########
@@ -387,8 +387,40 @@ public int nextDoc() {
 
     @Override
     public int advance(int target) throws IOException {
-      // We could do better by log-binary search in ordToDoc, but this is never used
-      return slowAdvance(target);
+      assert docID() < target;
+      int idx = binarySearch(target);

Review comment:
       I think we could, we would pay the very small cost of decoding the negative values it returns (and it must encode them), when we don't care about the distinction of present/not-present, but probably that's not significant and we avoid the cost of maintaining our own binary search here. +1




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] msokolov commented on pull request #2214: LUCENE-9674:Faster advance on Vector Values

Posted by GitBox <gi...@apache.org>.

msokolov commented on pull request #2214:
URL: https://github.com/apache/lucene-solr/pull/2214#issuecomment-764702532


   >  It works a bit better if you push a new commit for each revision rather than force-push.
   Oh hmm, I see you did do that! I;m not sure why I thought you had force-pushed... never mind!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] msokolov merged pull request #2214: LUCENE-9674:Faster advance on Vector Values

Posted by GitBox <gi...@apache.org>.

msokolov merged pull request #2214:
URL: https://github.com/apache/lucene-solr/pull/2214


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] codaitya commented on a change in pull request #2214: LUCENE-9674:Faster advance on Vector Values

Posted by GitBox <gi...@apache.org>.

codaitya commented on a change in pull request #2214:
URL: https://github.com/apache/lucene-solr/pull/2214#discussion_r560718642



##########
File path: lucene/core/src/test/org/apache/lucene/index/TestVectorValues.java
##########
@@ -815,4 +816,47 @@ public void testSearchStrategyIdentifiers() {
     assertEquals(2, VectorValues.SearchStrategy.DOT_PRODUCT_HNSW.ordinal());
     assertEquals(3, VectorValues.SearchStrategy.values().length);
   }
+
+  public void testAdvance() throws Exception {
+    try (Directory dir = newDirectory()) {
+      try (IndexWriter w = new IndexWriter(dir, createIndexWriterConfig())) {
+        int numdocs = 2000;

Review comment:
       Will use it in next revision




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] codaitya commented on a change in pull request #2214: LUCENE-9674:Faster advance on Vector Values

Posted by GitBox <gi...@apache.org>.

codaitya commented on a change in pull request #2214:
URL: https://github.com/apache/lucene-solr/pull/2214#discussion_r560718191



##########
File path: lucene/core/src/test/org/apache/lucene/index/TestVectorValues.java
##########
@@ -815,4 +816,47 @@ public void testSearchStrategyIdentifiers() {
     assertEquals(2, VectorValues.SearchStrategy.DOT_PRODUCT_HNSW.ordinal());
     assertEquals(3, VectorValues.SearchStrategy.values().length);
   }
+
+  public void testAdvance() throws Exception {
+    try (Directory dir = newDirectory()) {
+      try (IndexWriter w = new IndexWriter(dir, createIndexWriterConfig())) {
+        int numdocs = 2000;
+        String fieldName = "field";
+        for (int i = 0; i < numdocs; i++) {
+          Document doc = new Document();
+          // randomly add a vector field
+          if (random().nextInt(4) == 3) {
+            doc.add(
+                new VectorField(
+                    fieldName, new float[4], VectorValues.SearchStrategy.DOT_PRODUCT_HNSW));
+          }
+          w.addDocument(doc);
+        }
+        w.forceMerge(1);
+        try (IndexReader reader = w.getReader()) {
+          LeafReader r = getOnlyLeafReader(reader);
+          VectorValues vectorValues = r.getVectorValues(fieldName);
+          TreeSet<Integer> vectorDocs = new TreeSet<>();

Review comment:
       We might need to advance the index by more than 1 sometimes ? But I can also do that with an array since it's always increasing. I will update to use array instead of TreeSet.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org