You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/09/19 23:24:17 UTC

[GitHub] [lucene] msokolov opened a new pull request, #11790: Mark HNSW search results incomplete when fewer than topK are found

msokolov opened a new pull request, #11790:
URL: https://github.com/apache/lucene/pull/11790

   This addresses a random test failure that came up recently due to another fix. I think this failure exposed a hole in our logic; when a search returns fewer results than requested *and we have not explored the entire graph*, we should fall back to exhaustive search. This can happen in degenerate cases such as this test creates.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on a diff in pull request #11790: Mark HNSW search results incomplete when fewer than topK are found

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on code in PR #11790:
URL: https://github.com/apache/lucene/pull/11790#discussion_r975678640


##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java:
##########
@@ -267,6 +267,9 @@ private NeighborQueue searchLevel(
     while (results.size() > topK) {
       results.pop();
     }
+    if (level == 0 && results.size() < topK && numVisited < size) {

Review Comment:
   I'm feeling unsure about this check -- I'm having trouble understanding the general principle. First, it only applies when a filter is present, but it seems like we could run into the problem even without filtering (in a highly disconnected graph?)
   
   Also, could it sometimes cause surprisingly slow searches? Let's say you had a filter that matched 90% of documents. Because the graph is disconnected, you return find fewer than `topK` matches. Then you fall back to an exact scan, but this needs to consider 90% of the index!
   
   For me it'd be better to just remove this test (as we chatted about in https://github.com/apache/lucene/issues/11787), since we generally don't have proper handling for poorly connected graphs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] msokolov closed pull request #11790: Mark HNSW search results incomplete when fewer than topK are found

Posted by GitBox <gi...@apache.org>.

msokolov closed pull request #11790: Mark HNSW search results incomplete when fewer than topK are found
URL: https://github.com/apache/lucene/pull/11790


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] msokolov commented on a diff in pull request #11790: Mark HNSW search results incomplete when fewer than topK are found

Posted by GitBox <gi...@apache.org>.

msokolov commented on code in PR #11790:
URL: https://github.com/apache/lucene/pull/11790#discussion_r975881444


##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java:
##########
@@ -267,6 +267,9 @@ private NeighborQueue searchLevel(
     while (results.size() > topK) {
       results.pop();
     }
+    if (level == 0 && results.size() < topK && numVisited < size) {

Review Comment:
   That's fair - this seems a bit risky, certainly more so than dropping the test.
   
   I thought this would work even when there is no filter, but I may have been confused?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on a diff in pull request #11790: Mark HNSW search results incomplete when fewer than topK are found

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on code in PR #11790:
URL: https://github.com/apache/lucene/pull/11790#discussion_r975678640


##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java:
##########
@@ -267,6 +267,9 @@ private NeighborQueue searchLevel(
     while (results.size() > topK) {
       results.pop();
     }
+    if (level == 0 && results.size() < topK && numVisited < size) {

Review Comment:
   I'm feeling unsure about this check -- I'm having trouble understanding the general principle. First, it only applies when a filter is present, but it seems like we could run into the problem even without filtering (in a highly disconnected graph?)
   
   Also, could it sometimes cause surprisingly slow searches? Let's say you had a filter that matched 90% of documents. Because the graph is disconnected, we find fewer than `topK` matches. Then we fall back to an exact scan, but this needs to consider 90% of the index!
   
   For me it'd be better to just remove this test (as we chatted about in https://github.com/apache/lucene/issues/11787), since we generally don't have proper handling for poorly connected graphs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org