You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/08/25 17:50:16 UTC

[GitHub] [lucene] msokolov opened a new pull request #262: LUCENE-10063: implement SimpleTextKnnvectorsReader.search

msokolov opened a new pull request #262:
URL: https://github.com/apache/lucene/pull/262


   simple linear scan brute force vector search impl


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on a change in pull request #262: LUCENE-10063: implement SimpleTextKnnvectorsReader.search

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on a change in pull request #262:
URL: https://github.com/apache/lucene/pull/262#discussion_r696153258



##########
File path: lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java
##########
@@ -140,7 +147,38 @@ public VectorValues getVectorValues(String field) throws IOException {
 
   @Override
   public TopDocs search(String field, float[] target, int k, Bits acceptDocs) throws IOException {
-    throw new UnsupportedOperationException();
+    VectorValues values = getVectorValues(field);

Review comment:
       If the field exists but vectors are not enabled, then I think we'll get `VectorValues.EMPTY` here, and either trip on `target.length != values.dimension()` or return an empty `TopDocs`? It could cleaner to catch this early and return `null` as we do for `Lucene90HnswVectorsReader`. Maybe this would involve changing `getVectorValues` to return `null` if vectors are not enabled.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] msokolov commented on pull request #262: LUCENE-10063: implement SimpleTextKnnvectorsReader.search

Posted by GitBox <gi...@apache.org>.
msokolov commented on pull request #262:
URL: https://github.com/apache/lucene/pull/262#issuecomment-909460252


   Thanks for the reminder! I had lost track...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on a change in pull request #262: LUCENE-10063: implement SimpleTextKnnvectorsReader.search

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on a change in pull request #262:
URL: https://github.com/apache/lucene/pull/262#discussion_r696839771



##########
File path: lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java
##########
@@ -140,7 +147,38 @@ public VectorValues getVectorValues(String field) throws IOException {
 
   @Override
   public TopDocs search(String field, float[] target, int k, Bits acceptDocs) throws IOException {
-    throw new UnsupportedOperationException();
+    VectorValues values = getVectorValues(field);

Review comment:
       Thanks, that's good to know. I guess we could simplify the handling in some other places then too.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] msokolov merged pull request #262: LUCENE-10063: implement SimpleTextKnnvectorsReader.search

Posted by GitBox <gi...@apache.org>.
msokolov merged pull request #262:
URL: https://github.com/apache/lucene/pull/262


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on a change in pull request #262: LUCENE-10063: implement SimpleTextKnnvectorsReader.search

Posted by GitBox <gi...@apache.org>.
jpountz commented on a change in pull request #262:
URL: https://github.com/apache/lucene/pull/262#discussion_r696363685



##########
File path: lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java
##########
@@ -140,7 +147,38 @@ public VectorValues getVectorValues(String field) throws IOException {
 
   @Override
   public TopDocs search(String field, float[] target, int k, Bits acceptDocs) throws IOException {
-    throw new UnsupportedOperationException();
+    VectorValues values = getVectorValues(field);
+    if (values == null) {
+      return null;
+    }
+    if (target.length != values.dimension()) {
+      throw new IllegalArgumentException(
+          "incorrect dimension for field "
+              + field
+              + "; expected "
+              + values.dimension()
+              + " but target has "
+              + target.length);
+    }
+    FieldInfo info = readState.fieldInfos.fieldInfo(field);
+    VectorSimilarityFunction vectorSimilarity = info.getVectorSimilarityFunction();
+    HitQueue topK = new HitQueue(k, false);
+    int doc;
+    while ((doc = values.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
+      float[] vector = values.vectorValue();
+      float score = vectorSimilarity.compare(vector, target);
+      if (vectorSimilarity.reversed) {
+        score = 1 / (score + 1);
+      }
+      topK.insertWithOverflow(new ScoreDoc(doc, score));
+    }
+    ScoreDoc[] topScoreDocs = new ScoreDoc[topK.size()];
+    int i = 0;
+    for (ScoreDoc scoreDoc : topK) {
+      topScoreDocs[i++] = scoreDoc;
+    }
+    Arrays.sort(topScoreDocs, Comparator.comparingInt(x -> x.doc));

Review comment:
       I would have expected this method to return vectors by descending score. This makes me curious about what the exact contract of this method is regarding the order of the hits. If it is unspecified, maybe we should make it clearer in javadocs?

##########
File path: lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java
##########
@@ -140,7 +147,38 @@ public VectorValues getVectorValues(String field) throws IOException {
 
   @Override
   public TopDocs search(String field, float[] target, int k, Bits acceptDocs) throws IOException {
-    throw new UnsupportedOperationException();
+    VectorValues values = getVectorValues(field);

Review comment:
       Our general approach for these problems is that the `XXXReader`/`XXXProducer` classes can assume that their methods are only called on fields that have the feature enabled according to `FieldInfos`, and it's the responsibility of `CodecReader` to check `FieldInfos` before forwarding calls to `XXXReader`/`XXXProducer` classes. This seems to already be done correctly in `CodecReader#searchNearestVectors`.
   
   So I think we are good, and we should even remove the `if (values == null)` check below, which is not necessary and might hide bugs.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on a change in pull request #262: LUCENE-10063: implement SimpleTextKnnvectorsReader.search

Posted by GitBox <gi...@apache.org>.
jpountz commented on a change in pull request #262:
URL: https://github.com/apache/lucene/pull/262#discussion_r696747416



##########
File path: lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java
##########
@@ -140,7 +145,36 @@ public VectorValues getVectorValues(String field) throws IOException {
 
   @Override
   public TopDocs search(String field, float[] target, int k, Bits acceptDocs) throws IOException {
-    throw new UnsupportedOperationException();
+    VectorValues values = getVectorValues(field);
+    if (values == null) {
+      return null;
+    }

Review comment:
       Let's remove this null check? It shouldn't be necessary since `getVectorValues()` is expected to return a non-null value when the number of dimentions is greater than 0 on the `FieldInfo`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #262: LUCENE-10063: implement SimpleTextKnnvectorsReader.search

Posted by GitBox <gi...@apache.org>.
jpountz commented on pull request #262:
URL: https://github.com/apache/lucene/pull/262#issuecomment-908331029


   @msokolov Let's merge this PR to stop test failures?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] msokolov commented on a change in pull request #262: LUCENE-10063: implement SimpleTextKnnvectorsReader.search

Posted by GitBox <gi...@apache.org>.
msokolov commented on a change in pull request #262:
URL: https://github.com/apache/lucene/pull/262#discussion_r696706087



##########
File path: lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java
##########
@@ -140,7 +147,38 @@ public VectorValues getVectorValues(String field) throws IOException {
 
   @Override
   public TopDocs search(String field, float[] target, int k, Bits acceptDocs) throws IOException {
-    throw new UnsupportedOperationException();
+    VectorValues values = getVectorValues(field);

Review comment:
       Thanks, let's follow the convention of relying on callers to do such checking then.

##########
File path: lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java
##########
@@ -140,7 +147,38 @@ public VectorValues getVectorValues(String field) throws IOException {
 
   @Override
   public TopDocs search(String field, float[] target, int k, Bits acceptDocs) throws IOException {
-    throw new UnsupportedOperationException();
+    VectorValues values = getVectorValues(field);
+    if (values == null) {
+      return null;
+    }
+    if (target.length != values.dimension()) {
+      throw new IllegalArgumentException(
+          "incorrect dimension for field "
+              + field
+              + "; expected "
+              + values.dimension()
+              + " but target has "
+              + target.length);
+    }
+    FieldInfo info = readState.fieldInfos.fieldInfo(field);
+    VectorSimilarityFunction vectorSimilarity = info.getVectorSimilarityFunction();
+    HitQueue topK = new HitQueue(k, false);
+    int doc;
+    while ((doc = values.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
+      float[] vector = values.vectorValue();
+      float score = vectorSimilarity.compare(vector, target);
+      if (vectorSimilarity.reversed) {
+        score = 1 / (score + 1);
+      }
+      topK.insertWithOverflow(new ScoreDoc(doc, score));
+    }
+    ScoreDoc[] topScoreDocs = new ScoreDoc[topK.size()];
+    int i = 0;
+    for (ScoreDoc scoreDoc : topK) {
+      topScoreDocs[i++] = scoreDoc;
+    }
+    Arrays.sort(topScoreDocs, Comparator.comparingInt(x -> x.doc));

Review comment:
       No, that's exactly right - should be sorted by score here, not by docid - I was confused having just written the Query implementation. I do see the `KnnVectorsReader` javadoc doesn't explicitly state what the contract is supposed to be; let's rectify that in a separate issue.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #262: LUCENE-10063: implement SimpleTextKnnvectorsReader.search

Posted by GitBox <gi...@apache.org>.
jpountz commented on pull request #262:
URL: https://github.com/apache/lucene/pull/262#issuecomment-908331029


   @msokolov Let's merge this PR to stop test failures?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] msokolov merged pull request #262: LUCENE-10063: implement SimpleTextKnnvectorsReader.search

Posted by GitBox <gi...@apache.org>.
msokolov merged pull request #262:
URL: https://github.com/apache/lucene/pull/262


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on a change in pull request #262: LUCENE-10063: implement SimpleTextKnnvectorsReader.search

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on a change in pull request #262:
URL: https://github.com/apache/lucene/pull/262#discussion_r696842465



##########
File path: lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java
##########
@@ -140,7 +147,38 @@ public VectorValues getVectorValues(String field) throws IOException {
 
   @Override
   public TopDocs search(String field, float[] target, int k, Bits acceptDocs) throws IOException {
-    throw new UnsupportedOperationException();
+    VectorValues values = getVectorValues(field);
+    if (values == null) {
+      return null;
+    }
+    if (target.length != values.dimension()) {
+      throw new IllegalArgumentException(
+          "incorrect dimension for field "
+              + field
+              + "; expected "
+              + values.dimension()
+              + " but target has "
+              + target.length);
+    }
+    FieldInfo info = readState.fieldInfos.fieldInfo(field);
+    VectorSimilarityFunction vectorSimilarity = info.getVectorSimilarityFunction();
+    HitQueue topK = new HitQueue(k, false);
+    int doc;
+    while ((doc = values.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
+      float[] vector = values.vectorValue();
+      float score = vectorSimilarity.compare(vector, target);
+      if (vectorSimilarity.reversed) {
+        score = 1 / (score + 1);
+      }
+      topK.insertWithOverflow(new ScoreDoc(doc, score));
+    }
+    ScoreDoc[] topScoreDocs = new ScoreDoc[topK.size()];
+    int i = 0;
+    for (ScoreDoc scoreDoc : topK) {
+      topScoreDocs[i++] = scoreDoc;
+    }
+    Arrays.sort(topScoreDocs, Comparator.comparingInt(x -> x.doc));

Review comment:
       I was apparently confused too :) It'd also be great to document what the TotalHits part of the result means, this is a bit subtle. I think it's the number of docs that were visited during the search.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] msokolov commented on pull request #262: LUCENE-10063: implement SimpleTextKnnvectorsReader.search

Posted by GitBox <gi...@apache.org>.
msokolov commented on pull request #262:
URL: https://github.com/apache/lucene/pull/262#issuecomment-909460252


   Thanks for the reminder! I had lost track...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org