You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/10/25 19:26:37 UTC

[GitHub] [lucene] jtibshirani opened a new pull request #413: LUCENE-9614: Fix KnnVectorQuery failure when numDocs is 0

jtibshirani opened a new pull request #413:
URL: https://github.com/apache/lucene/pull/413


   When the reader has no live docs, `KnnVectorQuery` can error out. This happens
   because `IndexReader#numDocs` is 0, and we end up passing an illegal value of
   `k = 0` to the search method.
   
   Now we make sure `KnnVectorQuery` always passes `k > 0` to the search method.
   The commit also updates the numDocs check to use the leaf reader (instead of
   top-level reader), which seems more helpful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on a change in pull request #413: LUCENE-9614: Fix KnnVectorQuery failure when numDocs is 0

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on a change in pull request #413:
URL: https://github.com/apache/lucene/pull/413#discussion_r737599366



##########
File path: lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java
##########
@@ -25,18 +25,10 @@
 import java.io.IOException;
 import java.util.HashSet;
 import java.util.Set;
-import org.apache.lucene.document.Document;
-import org.apache.lucene.document.Field;
-import org.apache.lucene.document.KnnVectorField;
-import org.apache.lucene.document.StringField;
-import org.apache.lucene.index.DirectoryReader;
-import org.apache.lucene.index.IndexReader;
-import org.apache.lucene.index.IndexWriter;
-import org.apache.lucene.index.IndexWriterConfig;
-import org.apache.lucene.index.RandomIndexWriter;
-import org.apache.lucene.index.Term;
-import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.document.*;
+import org.apache.lucene.index.*;

Review comment:
       I also noticed our static analysis is totally fine with it (surprisingly?) I'll need to fix my IntelliJ setup :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on a change in pull request #413: LUCENE-9614: Fix KnnVectorQuery failure when numDocs is 0

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on a change in pull request #413:
URL: https://github.com/apache/lucene/pull/413#discussion_r737016446



##########
File path: lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java
##########
@@ -60,7 +60,8 @@ public KnnVectorQuery(String field, float[] target, int k) {
   public Query rewrite(IndexReader reader) throws IOException {
     TopDocs[] perLeafResults = new TopDocs[reader.leaves().size()];
     for (LeafReaderContext ctx : reader.leaves()) {
-      perLeafResults[ctx.ord] = searchLeaf(ctx, Math.min(k, reader.numDocs()));
+      int numDocs = ctx.reader().numDocs();
+      perLeafResults[ctx.ord] = numDocs > 0 ? searchLeaf(ctx, Math.min(k, numDocs)) : NO_RESULTS;

Review comment:
       This makes sense to me, I pushed a change. Instead of `Lucene90HnswVectorsReader`, I thought it could make sense to apply the bound in `HnswGraph`. But this turned out messier because there's separate concepts for `topK` and `numSeed` (we're cleaning this up as part of [LUCENE-10054](https://issues.apache.org/jira/browse/LUCENE-10054)).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on a change in pull request #413: LUCENE-9614: Fix KnnVectorQuery failure when numDocs is 0

Posted by GitBox <gi...@apache.org>.
jpountz commented on a change in pull request #413:
URL: https://github.com/apache/lucene/pull/413#discussion_r737165184



##########
File path: lucene/core/src/test/org/apache/lucene/search/TestKnnVectorQuery.java
##########
@@ -25,18 +25,10 @@
 import java.io.IOException;
 import java.util.HashSet;
 import java.util.Set;
-import org.apache.lucene.document.Document;
-import org.apache.lucene.document.Field;
-import org.apache.lucene.document.KnnVectorField;
-import org.apache.lucene.document.StringField;
-import org.apache.lucene.index.DirectoryReader;
-import org.apache.lucene.index.IndexReader;
-import org.apache.lucene.index.IndexWriter;
-import org.apache.lucene.index.IndexWriterConfig;
-import org.apache.lucene.index.RandomIndexWriter;
-import org.apache.lucene.index.Term;
-import org.apache.lucene.index.VectorSimilarityFunction;
+import org.apache.lucene.document.*;
+import org.apache.lucene.index.*;

Review comment:
       Oh, I thought we failed the build on wildcard imports, but apparently we don't. Maybe still use explicit imports to reduce line changes of this PR?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani merged pull request #413: LUCENE-9614: Fix KnnVectorQuery failure when numDocs is 0

Posted by GitBox <gi...@apache.org>.
jtibshirani merged pull request #413:
URL: https://github.com/apache/lucene/pull/413


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on a change in pull request #413: LUCENE-9614: Fix KnnVectorQuery failure when numDocs is 0

Posted by GitBox <gi...@apache.org>.
jpountz commented on a change in pull request #413:
URL: https://github.com/apache/lucene/pull/413#discussion_r736218409



##########
File path: lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java
##########
@@ -60,7 +60,8 @@ public KnnVectorQuery(String field, float[] target, int k) {
   public Query rewrite(IndexReader reader) throws IOException {
     TopDocs[] perLeafResults = new TopDocs[reader.leaves().size()];
     for (LeafReaderContext ctx : reader.leaves()) {
-      perLeafResults[ctx.ord] = searchLeaf(ctx, Math.min(k, reader.numDocs()));
+      int numDocs = ctx.reader().numDocs();
+      perLeafResults[ctx.ord] = numDocs > 0 ? searchLeaf(ctx, Math.min(k, numDocs)) : NO_RESULTS;

Review comment:
       I was thinking of passing `k` here, and moving the logic to avoid oversizing the heap to Lucene90HnswVectorsReader by doing `k = min(k, size())` (where `size()` is the number of docs that have a vector).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on pull request #413: LUCENE-9614: Fix KnnVectorQuery failure when numDocs is 0

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on pull request #413:
URL: https://github.com/apache/lucene/pull/413#issuecomment-951238852


   This is a pretty obscure bug, I only noticed it because we use a custom `FilterLeafReader` as part of our security implementation. I wasn't able to reproduce it using regular deletes or even soft deletes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on a change in pull request #413: LUCENE-9614: Fix KnnVectorQuery failure when numDocs is 0

Posted by GitBox <gi...@apache.org>.
jpountz commented on a change in pull request #413:
URL: https://github.com/apache/lucene/pull/413#discussion_r735928316



##########
File path: lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java
##########
@@ -60,7 +60,8 @@ public KnnVectorQuery(String field, float[] target, int k) {
   public Query rewrite(IndexReader reader) throws IOException {
     TopDocs[] perLeafResults = new TopDocs[reader.leaves().size()];
     for (LeafReaderContext ctx : reader.leaves()) {
-      perLeafResults[ctx.ord] = searchLeaf(ctx, Math.min(k, reader.numDocs()));
+      int numDocs = ctx.reader().numDocs();
+      perLeafResults[ctx.ord] = numDocs > 0 ? searchLeaf(ctx, Math.min(k, numDocs)) : NO_RESULTS;

Review comment:
       This makes me wonder why we pass `min(k, numDocs)` rather than just `k`. Is it to avoid oversizing the heap that collect nearest neighbors?

##########
File path: lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java
##########
@@ -60,7 +60,8 @@ public KnnVectorQuery(String field, float[] target, int k) {
   public Query rewrite(IndexReader reader) throws IOException {
     TopDocs[] perLeafResults = new TopDocs[reader.leaves().size()];
     for (LeafReaderContext ctx : reader.leaves()) {
-      perLeafResults[ctx.ord] = searchLeaf(ctx, Math.min(k, reader.numDocs()));
+      int numDocs = ctx.reader().numDocs();
+      perLeafResults[ctx.ord] = numDocs > 0 ? searchLeaf(ctx, Math.min(k, numDocs)) : NO_RESULTS;

Review comment:
       One reason why I'm wondering this is because in the past we tried to avoid calling `numDocs()`unless it was strictly necessary in the past because it's expensive on reader views that hide subsets of documents ([LUCENE-9003](https://issues.apache.org/jira/browse/LUCENE-9003)).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on a change in pull request #413: LUCENE-9614: Fix KnnVectorQuery failure when numDocs is 0

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on a change in pull request #413:
URL: https://github.com/apache/lucene/pull/413#discussion_r735961948



##########
File path: lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java
##########
@@ -60,7 +60,8 @@ public KnnVectorQuery(String field, float[] target, int k) {
   public Query rewrite(IndexReader reader) throws IOException {
     TopDocs[] perLeafResults = new TopDocs[reader.leaves().size()];
     for (LeafReaderContext ctx : reader.leaves()) {
-      perLeafResults[ctx.ord] = searchLeaf(ctx, Math.min(k, reader.numDocs()));
+      int numDocs = ctx.reader().numDocs();
+      perLeafResults[ctx.ord] = numDocs > 0 ? searchLeaf(ctx, Math.min(k, numDocs)) : NO_RESULTS;

Review comment:
       I also guessed it was to avoid oversizing the heap for tiny segments. I'm not sure how helpful this is though, maybe we can simplify and just use `k` here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] msokolov commented on a change in pull request #413: LUCENE-9614: Fix KnnVectorQuery failure when numDocs is 0

Posted by GitBox <gi...@apache.org>.
msokolov commented on a change in pull request #413:
URL: https://github.com/apache/lucene/pull/413#discussion_r737073221



##########
File path: lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java
##########
@@ -60,7 +60,8 @@ public KnnVectorQuery(String field, float[] target, int k) {
   public Query rewrite(IndexReader reader) throws IOException {
     TopDocs[] perLeafResults = new TopDocs[reader.leaves().size()];
     for (LeafReaderContext ctx : reader.leaves()) {
-      perLeafResults[ctx.ord] = searchLeaf(ctx, Math.min(k, reader.numDocs()));
+      int numDocs = ctx.reader().numDocs();
+      perLeafResults[ctx.ord] = numDocs > 0 ? searchLeaf(ctx, Math.min(k, numDocs)) : NO_RESULTS;

Review comment:
       Thanks for fixing this - it makes sense to me use `size()` instead of `numDocs()`, or even simply `k`; I wasn't aware of the costly nature of that call. Indeed the idea here was just to avoid spending extra work on tiny segments; something I noticed all the time in tests, but which is probably not much of an issue in reality.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org