You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/02/08 20:43:20 UTC

[GitHub] [lucene] jtibshirani edited a comment on pull request #656: LUCENE-10382: Support filtering in KnnVectorQuery

jtibshirani edited a comment on pull request #656:
URL: https://github.com/apache/lucene/pull/656#issuecomment-1032109021


   I tried out the around stopping the HNSW search early if it visits too many docs. To test, I modified `KnnGraphTester` to create `acceptDocs` uniformly at random with a certain selectivity, then measured recall and QPS. Here are the results on glove-100-angular (~1.2 million docs) with a filter selectivity 0.01:
   
   **Baseline**
   ```
   k        Recall    VisitedDocs     QPS  
   10        0.774       15957     232.083
   50        0.930       63429      58.994
   80        0.958       95175      42.470
   100       0.967      118891      35.203
   500       0.997     1176237       8.136
   800       0.999     1183514       5.571
   ```
   
   **PR**
   ```
   k        Recall    VisitedDocs     QPS  
   10        1.000	       22908     190.286
   50        1.000	       23607     152.224
   80        1.000	       23608     148.036
   100       1.000	       23608     145.381
   500       1.000	       23608     138.903
   800       1.000	       23608     137.882
   ```
   
   Since the filter is so selective, HNSW always visits more than 1% of the docs. The adaptive logic in the PR decides to stop the search and switch to an exact search, which bounds the visited docs at 2%. For `k=10` this makes the QPS a little worse, but overall prevents QPS from degrading (with the side benefit of perfect recall). I also tested with less restrictive filters, and in these cases the fallback just doesn't kick in, so the QPS remains the same as before.
   
   Overall I like this approach because it doesn't require us to fiddle with thresholds or expose new parameters. It could also help make HNSW more robust in "pathological" cases where even when the filter is not that selective, all the nearest vectors to a query happen to be filtered away.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org