You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/11/17 10:20:13 UTC

[GitHub] [lucene] agorlenko opened a new pull request, #11946: add similarity threshold for hnsw

agorlenko opened a new pull request, #11946:
URL: https://github.com/apache/lucene/pull/11946

   ### Description
   
   <!--
   If this is your first contribution to Lucene, please make sure you have reviewed the contribution guide.
   https://github.com/apache/lucene/blob/main/CONTRIBUTING.md
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [GitHub] [lucene] rmuir commented on pull request #11946: add similarity threshold for hnsw

Posted by Michael Sokolov <ms...@gmail.com>.

What I have in mind would be to implement entirely in the
KnnVectorQuery. Since results are sorted by score, they can easily be
post-filtered there: no need to implement anything at the codec layer
I think.

On Thu, Nov 17, 2022 at 10:10 AM GitBox <gi...@apache.org> wrote:
>
>
> rmuir commented on PR #11946:
> URL: https://github.com/apache/lucene/pull/11946#issuecomment-1318777402
>
>    i'm also concerned about committing to providing this API for the future. eventually, we'll move away from HNSW to something that actually scales, and it may not support this thresholding?
>
>
> --
> This is an automated message from the Apache Git Service.
> To respond to the message, please log on to GitHub and use the
> URL above to go to the specific comment.
>
> To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
>
> For queries about this service, please contact Infrastructure at:
> users@infra.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
> For additional commands, e-mail: issues-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[GitHub] [lucene] agorlenko commented on pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

agorlenko commented on PR #11946:
URL: https://github.com/apache/lucene/pull/11946#issuecomment-1340151458

   I've done some experiments with real data and it seems that it really doesn't work as I expected. If number of docs which exceed threshold is significant (for example 20% or more of previously accepted docs), the query works slow and it is better to perform exact search. And unfortunately it happens quite often. 
   
   So I agree with @msokolov and I think I should rewrite this PR with post-filtering approach. It allows us to preserve predictable performance and not modify LeafReader/IndexReader (just filter TopDocs in KnnVectorQuery).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] agorlenko commented on pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

agorlenko commented on PR #11946:
URL: https://github.com/apache/lucene/pull/11946#issuecomment-1367395855

   Ok, it seems that I should close this PR, shouldn't I? It is not difficult to implement such functionality in the code which uses lucene if it is necessary (in contrast to the first implementation).
   
   @msokolov what do you think?
   
   In any case, I thank you all for the discussion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] msokolov commented on pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

msokolov commented on PR #11946:
URL: https://github.com/apache/lucene/pull/11946#issuecomment-1320438152

   OK, can we start by providing post-filter? I think this will be a more
   common use case. I want to find the best docs, and ensure that none of them
   are terrible. It is less disruptive, doesn't require changes to the codec.
   Can you explain why you want the "find all docs with score > T"? That is
   going to be a scary thing. What if someone asks for T==0? Then the
   computation and memory requirements are unbounded. I don't think this is a
   search use case - it's some kind of analytics thing that you should do in
   Spark or some kind of off-line computation system.

   On Fri, Nov 18, 2022 at 2:01 PM Alexey Gorlenko ***@***.***>
   wrote:

   > But we don't know K - that's the problem. The task which I want to solve
   > sounds like this: find documents with similarity >= 0.76 (for example). We
   > don't have the number of such documents in advance.
   >
   > —
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/lucene/pull/11946#issuecomment-1320416549>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AAHHUQIDSRWIV4ZCGO375ITWI7HB7ANCNFSM6AAAAAASDGO4FQ>
   > .
   > You are receiving this because you commented.Message ID:
   > ***@***.***>
   >

-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

rmuir commented on PR #11946:
URL: https://github.com/apache/lucene/pull/11946#issuecomment-1339474637

   > What I have in mind would be to implement entirely in the
   > KnnVectorQuery. Since results are sorted by score, they can easily be
   > post-filtered there: no need to implement anything at the codec layer
   > I think. Am I missing something?
   
   is there any possibility other than adding all these LeafReader/IndexReader signatures?
   
   Currently I'm -1 to the change from an API persective. It is too invasive.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] agorlenko commented on pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

agorlenko commented on PR #11946:
URL: https://github.com/apache/lucene/pull/11946#issuecomment-1320221923

   If we use only post-filter in KnnVectorQuery, then we have to set k = Integer.MAX_VALUE (or another very big value) and calculate similarity with all vectors. So the complexity would be O(n). 
   
   I had another idea: we can check the similarity while we are traversing the graph. If similarity is less then threshold, we can get rid of this node and stop to explore this path. In that case we set k = Integer.MAX_VALUE, set similarityThreshold value, but the time complexity would be between O(log(n)) and O(n) (it depends on number of vectors with similarity greater than threshold). I hope that it allow us to solve task like the ones I described above (https://github.com/apache/lucene/pull/11946#issuecomment-1318924833) more efficiently.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on a diff in pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

rmuir commented on code in PR #11946:
URL: https://github.com/apache/lucene/pull/11946#discussion_r1051525868


##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -76,12 +91,29 @@ public KnnVectorQuery(String field, float[] target, int k) {
    * @throws IllegalArgumentException if <code>k</code> is less than 1
    */
   public KnnVectorQuery(String field, float[] target, int k, Query filter) {
+    this(field, target, k, Float.NEGATIVE_INFINITY, filter);
+  }
+
+  /**
+   * Find the <code>k</code> nearest documents to the target vector according to the vectors in the
+   * given field. <code>target</code> vector.
+   *
+   * @param field a field that has been indexed as a {@link KnnVectorField}.
+   * @param target the target of the search
+   * @param k the number of documents to find (the upper bound)
+   * @param similarityThreshold the minimum acceptable value of similarity

Review Comment:
   still don't have any explanation here as to why we'd do this for vector search query. we avoided any such thresholds or normalization in any of lucene's scoring for decades: if we didn't do that, we would have never been able to implement block-max WAND or other algorithms because they'd be incompatible.
   
   please see:
   * https://cwiki.apache.org/confluence/display/LUCENE/LuceneFAQ#LuceneFAQ-CanIfilterbyscore?
   * https://cwiki.apache.org/confluence/display/LUCENE/ScoresAsPercentages
   
   I don't mind being the bad guy blocking this change because it seems like it has not been thought thru.
   
   You must convince me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

rmuir commented on PR #11946:
URL: https://github.com/apache/lucene/pull/11946#issuecomment-1357737816

   > * I agree that this kind of thing is valuable in KNN. KNN is unique when compared to sparse retrieval as you always retrieve K results (unless using a restrictive filter). In some cases, the K retrieved can be irrelevant, especially in the case when a filter is used. That said, it seems better fit outside of Lucene.
   
   this isn't any different than BM25 search.
   
   Nothing special about KNN. 
   
   Still no justification to filter by score / scores as percentages.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

rmuir commented on PR #11946:
URL: https://github.com/apache/lucene/pull/11946#issuecomment-1318754869

   how common is this use-case? This change is fairly invasive... adding method signatures to e.g. LeafReader. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] msokolov commented on pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

msokolov commented on PR #11946:
URL: https://github.com/apache/lucene/pull/11946#issuecomment-1320401647

   > If we use only post-filter in KnnVectorQuery, then we have to set k = Integer.MAX_VALUE (or another very big value) and calculate similarity with all vectors. So the complexity would be O(n).
   
   No, we don't have to do that. We can simply post-filter. Think of it like this - we want K matches with score > T. So we get the K top-scoring matches. If any have score less than T, we drop them. It's the same result as if we did the thresholding while collecting.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] msokolov commented on a diff in pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

msokolov commented on code in PR #11946:
URL: https://github.com/apache/lucene/pull/11946#discussion_r1049804451


##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -76,12 +91,29 @@ public KnnVectorQuery(String field, float[] target, int k) {
    * @throws IllegalArgumentException if <code>k</code> is less than 1
    */
   public KnnVectorQuery(String field, float[] target, int k, Query filter) {
+    this(field, target, k, Float.NEGATIVE_INFINITY, filter);
+  }
+
+  /**
+   * Find the <code>k</code> nearest documents to the target vector according to the vectors in the
+   * given field. <code>target</code> vector.
+   *
+   * @param field a field that has been indexed as a {@link KnnVectorField}.
+   * @param target the target of the search
+   * @param k the number of documents to find (the upper bound)
+   * @param similarityThreshold the minimum acceptable value of similarity

Review Comment:
   OK, with the current CR,   orthogonal vectors will have a DOT_PRODUCT  "score" of 0.5, which could be surprising. However, this is similar to how result scores are treated elsewhere in Lucene - their value ranges are not well-defined; the only guarantee is that higher scores are "more relevant".  I guess practically speaking, as a user, I think I am going to have to do empirical work to know what threshold to use; these are not likely going to be motivated by some a priori knowledge of what a "good" dot-product is, and given that I'd like to just be able to work with some kind of abstracted score in a known range (0 = worst, 1 = best).Conversely, if we were to switch to using vector similarities that would correspond more directly to the underlying functions, we would have to clearly define them (today we don't actually explain this anywhere, I guess we'd need to document) and maybe provide methods for computing them. Also they would be weird too, just in a different way. For exam
 ple, how would we explain 8-bit dot-product? Would it be the 8-bit dot-product score normalized by 2^15? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] benwtrent commented on a diff in pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

benwtrent commented on code in PR #11946:
URL: https://github.com/apache/lucene/pull/11946#discussion_r1048856364


##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -76,12 +91,29 @@ public KnnVectorQuery(String field, float[] target, int k) {
    * @throws IllegalArgumentException if <code>k</code> is less than 1
    */
   public KnnVectorQuery(String field, float[] target, int k, Query filter) {
+    this(field, target, k, Float.NEGATIVE_INFINITY, filter);
+  }
+
+  /**
+   * Find the <code>k</code> nearest documents to the target vector according to the vectors in the
+   * given field. <code>target</code> vector.
+   *
+   * @param field a field that has been indexed as a {@link KnnVectorField}.
+   * @param target the target of the search
+   * @param k the number of documents to find (the upper bound)
+   * @param similarityThreshold the minimum acceptable value of similarity

Review Comment:
   @msokolov you haven't missed anything. I am specifically talking about users providing `similarityThreshold` to the query. If they have calculating that they want a specific `cosine` or `dotProduct` similarity, they would then need to adjust that to match Lucene's scoring transformation.
   
   I think that `similarityThreshold` should mean vector similarities. We can transform it for the user to reflect the score that similarity represents (given vector encoding type and similarity function).
   
   
   An example here is `dotProduct`. The user knows they want `FLOAT32` vectors within a dotProduct of 0.7. With this API that ACTUALLY means they want to limit the scores to .85 (`(1 + dotProduct)/2`). How is the user supposed to know that?
   
   This seems really weird to me.
   
   This doesn't take into account the different scoring methods between vector types as well, which can get even more confusing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] agorlenko commented on pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

agorlenko commented on PR #11946:
URL: https://github.com/apache/lucene/pull/11946#issuecomment-1325269770

   @msokolov looking forward to your decision


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] agorlenko commented on pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

agorlenko commented on PR #11946:
URL: https://github.com/apache/lucene/pull/11946#issuecomment-1320416549

   But we don't know K - that's the problem. The task which I want to solve sounds like this: find documents with similarity >= 0.76 (for example). We don't have the number of such documents in advance.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] agorlenko commented on a diff in pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

agorlenko commented on code in PR #11946:
URL: https://github.com/apache/lucene/pull/11946#discussion_r1025027735


##########
lucene/core/src/test/org/apache/lucene/index/TestExitableDirectoryReader.java:
##########
@@ -494,13 +498,12 @@ public void testVectorValues() throws IOException {
           ExitingReaderException.class,
           () ->
               leaf.searchNearestVectors(
-                  "vector", new float[dimension], 5, leaf.getLiveDocs(), Integer.MAX_VALUE));
+                  "vector", target, 5, leaf.getLiveDocs(), Integer.MAX_VALUE));

Review Comment:
   There is a problem, because cosine similarity is not specified for zero vectors. As a result we have NaN score. I thought that it would be better not to handle that special case and get rid of docs with score equals NaN. We had the same behavior earlier, except the case with start point: https://github.com/apache/lucene/blob/a18b62ded49f1b091de7029716d6f63c06a36fc0/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java#L223-L225 We check acceptOrds.get here and raise an exception.  But actually we don't need to check acceptOrds in the case of zero vector and the cosine similarity. So I think it would be better just not to consider that case if we want to test ExitingReaderException and just define the target as a random float vector.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on a diff in pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

rmuir commented on code in PR #11946:
URL: https://github.com/apache/lucene/pull/11946#discussion_r1041043232


##########
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene90/Lucene90HnswVectorsReader.java:
##########
@@ -236,7 +236,13 @@ public VectorValues getVectorValues(String field) throws IOException {
   }
 
   @Override
-  public TopDocs search(String field, float[] target, int k, Bits acceptDocs, int visitedLimit)
+  public TopDocs search(
+      String field,
+      float[] target,
+      int k,
+      float similarityThreshold,
+      Bits acceptDocs,
+      int visitedLimit)

Review Comment:
   please overload the method, and tag all the APIs experimental. I'm really concerned about us locking ourselves into HNSW, and we must...must get away from it (its like 1000x slower than it should be).
   
   the alternative is to feature-freeze vectors completely until they scale. so i think this is a reasonable compromise.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [PR] add similarity threshold for hnsw [lucene]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #11946:
URL: https://github.com/apache/lucene/pull/11946#issuecomment-1880904151

   This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] benwtrent commented on a diff in pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

benwtrent commented on code in PR #11946:
URL: https://github.com/apache/lucene/pull/11946#discussion_r1048764283


##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -76,12 +91,29 @@ public KnnVectorQuery(String field, float[] target, int k) {
    * @throws IllegalArgumentException if <code>k</code> is less than 1
    */
   public KnnVectorQuery(String field, float[] target, int k, Query filter) {
+    this(field, target, k, Float.NEGATIVE_INFINITY, filter);
+  }
+
+  /**
+   * Find the <code>k</code> nearest documents to the target vector according to the vectors in the
+   * given field. <code>target</code> vector.
+   *
+   * @param field a field that has been indexed as a {@link KnnVectorField}.
+   * @param target the target of the search
+   * @param k the number of documents to find (the upper bound)
+   * @param similarityThreshold the minimum acceptable value of similarity

Review Comment:
   As it is, callers of this method need to know the inner nuances of how we calculate the score given the similarity. I would prefer them not having to know that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] msokolov commented on a diff in pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

msokolov commented on code in PR #11946:
URL: https://github.com/apache/lucene/pull/11946#discussion_r1048846385


##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -76,12 +91,29 @@ public KnnVectorQuery(String field, float[] target, int k) {
    * @throws IllegalArgumentException if <code>k</code> is less than 1
    */
   public KnnVectorQuery(String field, float[] target, int k, Query filter) {
+    this(field, target, k, Float.NEGATIVE_INFINITY, filter);
+  }
+
+  /**
+   * Find the <code>k</code> nearest documents to the target vector according to the vectors in the
+   * given field. <code>target</code> vector.
+   *
+   * @param field a field that has been indexed as a {@link KnnVectorField}.
+   * @param target the target of the search
+   * @param k the number of documents to find (the upper bound)
+   * @param similarityThreshold the minimum acceptable value of similarity

Review Comment:
   Well, the scores we are talking about here are at least always in [0, 1]. I'm not sure what you mean by the actual similarity of vectors. We used to have a two-step process where we would compute the similarity and then convert to a query score, but I think it's unified today and they are the same? Aren't the scores being thresholded here the output of VectorSimilarityFunction.compare? I may have missed something along the way?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] msokolov commented on a diff in pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

msokolov commented on code in PR #11946:
URL: https://github.com/apache/lucene/pull/11946#discussion_r1050171394


##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -76,12 +91,29 @@ public KnnVectorQuery(String field, float[] target, int k) {
    * @throws IllegalArgumentException if <code>k</code> is less than 1
    */
   public KnnVectorQuery(String field, float[] target, int k, Query filter) {
+    this(field, target, k, Float.NEGATIVE_INFINITY, filter);
+  }
+
+  /**
+   * Find the <code>k</code> nearest documents to the target vector according to the vectors in the
+   * given field. <code>target</code> vector.
+   *
+   * @param field a field that has been indexed as a {@link KnnVectorField}.
+   * @param target the target of the search
+   * @param k the number of documents to find (the upper bound)
+   * @param similarityThreshold the minimum acceptable value of similarity

Review Comment:
   > I don't know what CR means. Change request?
   
   sorry, yes like a PR but from a parallel universe (code review actually)
   
   So .. theoretical considerations aside, what's the alternative here -- we would treat the threshold as a "vector similarity" and internally convert it to a score. I mean that seems to make sense -- all the conversions are invertible, right? I think we'd want to add a normalize method to VectorSimilarity for this internal use.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] agorlenko commented on pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

agorlenko commented on PR #11946:
URL: https://github.com/apache/lucene/pull/11946#issuecomment-1355719097

   > It seems there are conflicts due to a recent refactor of this query - would you mind merging from main and resolving those please?
   
   Done


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] agorlenko commented on pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

agorlenko commented on PR #11946:
URL: https://github.com/apache/lucene/pull/11946#issuecomment-1318924833

> how common is this use-case? This change is fairly invasive... adding method signatures to e.g. LeafReader.

It is difficult for me to judge in general, but I face with such tasks quite often. Here is the start of the discussion about that functionality: https://lists.apache.org/list?dev@lucene.apache.org:lte=1M:HNSW%20search%20with%20threshold.

The typical case: suppose we have a recommendation system. We have a huge collection of items and we want to give user recommendation of items which would be suitable for him/her. Ranking models, which can provide high quality, can be quite complex and resource consuming. So we can build several layers of models. The most complex ranking model is the last level. Each previous level are easier than previous one, and it selects candidates for the next level. If we have good embeddings for items, then we can build the first layer in the following way. We can calculate similarity between some embedding of user and embeddings of items and compare the similarity value with threshold. If the similarity value exceeds threshold then we consider such item as candidate for next level. This approach can be very productive in practice. But complexity is a problem in this approach. Because we have to calculate cosine between user' embedding and all embeddings of items.

I think the proposed functionality would help with this kind of tasks.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] agorlenko commented on pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

agorlenko commented on PR #11946:
URL: https://github.com/apache/lucene/pull/11946#issuecomment-1318942412

   > i'm also concerned about committing to providing this API for the future. eventually, we'll move away from HNSW to something that actually scales, and it may not support this thresholding?
   
   It is a very good point, thanks! But I can't come up with idea of popular ann algorithm in which it would be impossible to support that functionality. Nevertheless, concerning about that, it may be worth to move this functionality in another class, not KnnVectorQuery... But I'm not sure. It seems that it can make api too complicated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] agorlenko commented on pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

agorlenko commented on PR #11946:
URL: https://github.com/apache/lucene/pull/11946#issuecomment-1320508166

> Can you explain why you want the "find all docs with score > T"?

For example, we want to give user only suitable for him/her documents. We have a custom scorer (based on ml-model, for example) which calculates a score. Next, we compare that score with the threshold to determine whether this document is suitable for the user or not. But usually that scorer too computationally complex to compute it for every document which passed filters. In order to deal with this problem we can build another model, much simpler. That new model would select candidates for the heavy model. One of the basic approaches for building that light model is knn: we have a vector (embedding) for user or users' query and we have a vector (embedding) for every document. So we just find the nearest documents and pass them to the heavy scorer. But we don't know K in that case, we know only the threshold. This threshold is defined during the development of the ranking model. Such tasks naturally arise in recommendation systems and ranking as well.

> That is going to be a scary thing. What if someone asks for T==0? Then the computation and memory requirements are unbounded.

The same result can be achieved by setting K = 1000...00. I think we don't add the new vulnerability here. Maybe it is worth to add a warning to the documentation (for K and for similarityThreshold).

If you still think that it's a bad idea to support such functionality in Lucene, I will rewrite this PR to the post-filter case. But I think it can be useful for people who add ML-ranking in search systems based on Lucene.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

rmuir commented on PR #11946:
URL: https://github.com/apache/lucene/pull/11946#issuecomment-1318777402

   i'm also concerned about committing to providing this API for the future. eventually, we'll move away from HNSW to something that actually scales, and it may not support this thresholding?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] benwtrent commented on a diff in pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

benwtrent commented on code in PR #11946:
URL: https://github.com/apache/lucene/pull/11946#discussion_r1040969450


##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java:
##########
@@ -37,6 +37,7 @@
  * @param <T> the type of query vector
  */
 public class HnswGraphSearcher<T> {
+  private final int UNBOUNDED_QUEUE_INIT_SIZE = 10_000;

Review Comment:
   Any research to indicate why this number was chosen? It seems silly that if a user provides `k = 10_001` it would have a queue bigger than `k = Integer.MAX_VALUE`.
   
   Technically, the max value here should be something like `ArrayUtil.MAX_ARRAY_LENGTH` But this eagerly allocates a `new long[heapSize];`. This is VERY costly.
   
   I would prefer a number with some significant reason behind it or some better way of queueing neighbors.



##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java:
##########
@@ -235,7 +312,7 @@ private NeighborQueue searchLevel(
     while (candidates.size() > 0 && results.incomplete() == false) {
       // get the best candidate (closest or best scoring)
       float topCandidateSimilarity = candidates.topScore();
-      if (topCandidateSimilarity < minAcceptedSimilarity) {
+      if (topCandidateSimilarity < minAcceptedSimilarity && results.size() >= topK) {
         break;
       }

Review Comment:
   I am not sure about this. This stops gathering results once its filled. This defeats the purpose of exploring the graph.
   
   Have you seen how this effects recall?



##########
lucene/core/src/java/org/apache/lucene/index/LeafReader.java:
##########
@@ -232,8 +232,48 @@ public final PostingsEnum postings(Term term) throws IOException {
    * @return the k nearest neighbor documents, along with their (searchStrategy-specific) scores.
    * @lucene.experimental
    */
+  public final TopDocs searchNearestVectors(
+      String field, float[] target, int k, Bits acceptDocs, int visitedLimit) throws IOException {
+    return searchNearestVectors(
+        field, target, k, Float.NEGATIVE_INFINITY, acceptDocs, visitedLimit);
+  }
+
+  /**
+   * Return the k nearest neighbor documents as determined by comparison of their vector values for
+   * this field, to the given vector, by the field's similarity function. The score of each document
+   * is derived from the vector similarity in a way that ensures scores are positive and that a
+   * larger score corresponds to a higher ranking.
+   *
+   * <p>The search is allowed to be approximate, meaning the results are not guaranteed to be the
+   * true k closest neighbors. For large values of k (for example when k is close to the total
+   * number of documents), the search may also retrieve fewer than k documents.
+   *
+   * <p>The returned {@link TopDocs} will contain a {@link ScoreDoc} for each nearest neighbor,
+   * sorted in order of their similarity to the query vector (decreasing scores). The {@link
+   * TotalHits} contains the number of documents visited during the search. If the search stopped
+   * early because it hit {@code visitedLimit}, it is indicated through the relation {@code
+   * TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO}.
+   *
+   * @param field the vector field to search
+   * @param target the vector-valued query
+   * @param k the number of docs to return (the upper bound)
+   * @param similarityThreshold the minimum acceptable value of similarity

Review Comment:
   Would it be possible for this threshold to be an actual distance? My concern here is that for things like `byteVectors`, dot-product scores are insanely small (I think this is a design flaw in itself) and may be confusing to users who want a given "radius" but instead have to figure out a score related to their radius. 
   
   It would be prudent that IF we provided some filtering on a threshold within the search, that this threshold reflects vector distance directly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] msokolov commented on pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

msokolov commented on PR #11946:
URL: https://github.com/apache/lucene/pull/11946#issuecomment-1351747886

   It seems there are conflicts due to a recent refactor of this query - would you mind merging from main and resolving those please?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on a diff in pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

rmuir commented on code in PR #11946:
URL: https://github.com/apache/lucene/pull/11946#discussion_r1050228933


##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -76,12 +91,29 @@ public KnnVectorQuery(String field, float[] target, int k) {
    * @throws IllegalArgumentException if <code>k</code> is less than 1
    */
   public KnnVectorQuery(String field, float[] target, int k, Query filter) {
+    this(field, target, k, Float.NEGATIVE_INFINITY, filter);
+  }
+
+  /**
+   * Find the <code>k</code> nearest documents to the target vector according to the vectors in the
+   * given field. <code>target</code> vector.
+   *
+   * @param field a field that has been indexed as a {@link KnnVectorField}.
+   * @param target the target of the search
+   * @param k the number of documents to find (the upper bound)
+   * @param similarityThreshold the minimum acceptable value of similarity

Review Comment:
   ben for the normal scoring, you can look at tests for similarities package. none of these have any 0 to 1 range or anything like that. instead requirements are that score increases semimonotonically as term frequency increases, decreases wrt documents length, etc. these guarantees allow optimizations such as block max wand to be applied safely. but theres no defined range at all. instead lots of crazy floating point hacks so that we can safely get really good performance.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] benwtrent commented on a diff in pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

benwtrent commented on code in PR #11946:
URL: https://github.com/apache/lucene/pull/11946#discussion_r1049904819


##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -76,12 +91,29 @@ public KnnVectorQuery(String field, float[] target, int k) {
    * @throws IllegalArgumentException if <code>k</code> is less than 1
    */
   public KnnVectorQuery(String field, float[] target, int k, Query filter) {
+    this(field, target, k, Float.NEGATIVE_INFINITY, filter);
+  }
+
+  /**
+   * Find the <code>k</code> nearest documents to the target vector according to the vectors in the
+   * given field. <code>target</code> vector.
+   *
+   * @param field a field that has been indexed as a {@link KnnVectorField}.
+   * @param target the target of the search
+   * @param k the number of documents to find (the upper bound)
+   * @param similarityThreshold the minimum acceptable value of similarity

Review Comment:
   Tl;dr
   
   Thank you for bearing with me! I think this is a good change.
   
   I would be happy with the JavaDocs, etc. clearly indicating that this threshold relates to the un-boosted vector score, not the raw similarity calculation. Dot-product, cosine, and euclidean are well defined concepts outside of Lucene. Lucene mangles (for undoubtably good reasons) the output of these similarities in undocumented ways to fit within boundaries.
   
   > with the current CR,
   
   I don't know what `CR` means. Change request?
   
   > However, this is similar to how result scores are treated elsewhere in Lucene - their value ranges are not well-defined;
   
   Agreed, ranges are usually predicated on term statistics, etc. and can potentially be considered "unbounded" as the corpus changes. 
   
   However, does Lucene require that all unboosted BM25 scores are between 0-1? It does seem like an "arbitrary" decision (to me, I don't know the full-breadth of Lucene optimizations, etc. when it comes to scores) to restrict vector similarity in this way. But that is a broader conversation. I have some learning to do.
   
   >  I guess practically speaking, as a user, I think I am going to have to do empirical work to know what threshold to use; these are not likely going to be motivated by some a priori knowledge of what a "good" dot-product is
   
   I would argue that a user could have a priori knowledge here. Think of it in the use case when the user knows their model used to make the vectors. At that point, they 100% know what is considered relevant based on their loss function and training + test data. Choosing a dot-product or cosine threshold that fits within 90% percentile or something given their test data results.
   
   I agree that this would be different if users were using an "off the shelf" model. In that case, they would probably require hybrid-search and combining with BM25 to get anything like relevant results (boosting various queries accordingly). Thus, learning what settings are required in an unfiltered case.
   
   > if we were to switch to using vector similarities that would correspond more directly to the underlying functions, we would have to clearly define them
   
   Cosine, dot-product, euclidean, are all already well defined. The functions to calculate them are universally recognized. Where Lucene separates itself is the manipulation of the similarity output to fit into a range [0, 1]. I guess this is cost of doing business in Lucene.
   
   I am not suggesting that all scoring of vector document searches changes. Simply that "similarity" and "score" are related, but are different things. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] agorlenko commented on pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

agorlenko commented on PR #11946:
URL: https://github.com/apache/lucene/pull/11946#issuecomment-1346767509

   I've rewritten this PR with post-filtering approach, sorry for the delay.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] benwtrent commented on a diff in pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

benwtrent commented on code in PR #11946:
URL: https://github.com/apache/lucene/pull/11946#discussion_r1048763428


##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -76,12 +91,29 @@ public KnnVectorQuery(String field, float[] target, int k) {
    * @throws IllegalArgumentException if <code>k</code> is less than 1
    */
   public KnnVectorQuery(String field, float[] target, int k, Query filter) {
+    this(field, target, k, Float.NEGATIVE_INFINITY, filter);
+  }
+
+  /**
+   * Find the <code>k</code> nearest documents to the target vector according to the vectors in the
+   * given field. <code>target</code> vector.
+   *
+   * @param field a field that has been indexed as a {@link KnnVectorField}.
+   * @param target the target of the search
+   * @param k the number of documents to find (the upper bound)
+   * @param similarityThreshold the minimum acceptable value of similarity

Review Comment:
   So, right now, `similarityThreshold` is really `scoreThreshold`. I would prefer it to be the actual similarity of the vectors and NOT how we translate it to scores (which for ByteVectors has some surprising behavior). 
   
   @msokolov What do you think here? 
   
   Its already tricky that "similarity implies score". But truly, similarity != score.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] agorlenko commented on a diff in pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

agorlenko commented on code in PR #11946:
URL: https://github.com/apache/lucene/pull/11946#discussion_r1041584638


##########
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java:
##########
@@ -37,6 +37,7 @@
  * @param <T> the type of query vector
  */
 public class HnswGraphSearcher<T> {
+  private final int UNBOUNDED_QUEUE_INIT_SIZE = 10_000;

Review Comment:
   I wanted to set some quite big value of heap's initial size in order to reduce number of possible heap's grows. But it seems that post-filtering would be better: https://github.com/apache/lucene/pull/11946#issuecomment-1340151458 
   
   In this case we don't have to modify `HnswGraphSearcher` at all.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] msokolov commented on pull request #11946: add similarity threshold for hnsw

Posted by GitBox <gi...@apache.org>.

msokolov commented on PR #11946:
URL: https://github.com/apache/lucene/pull/11946#issuecomment-1329282101

Hi, I was taking time off for a few days, back now. Have you tried post-filtering? When we added support for existing pre-filter (accepting Query) there was some extensive testing to determine when it is better to pre-filter vs post-filter. The answer is not always so clear-cut. If the filter is not so restrictive (matches > 90% of docs, say), you are probably better off post-filtering. If it is highly restrictive then prefiltering will likely offer performance gains. If it's possible in your application to precompute the filter and cache it for some time (eg in a user session), then you can use the existing prefiltering operation by creating a BitSet matching docs that meet the threshold criterion.

So I would suggest trying a large K and post-filtering and see if you get reasonable results?

In short, I think this is too risky/trappy for most users. Using a highly-restrictive scoring threshold is really not the same as using a large K from a user perspective since the cost is predictable with K (not very data dependent), but not so with the score (as a user I don't know what the score distribution is, a priori), so providing a score threshold is definitely more dangerous/trappy.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org