You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/05/31 06:40:01 UTC

[GitHub] [lucene] kaivalnp opened a new pull request, #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

kaivalnp opened a new pull request, #932:
URL: https://github.com/apache/lucene/pull/932

   ### Description
   Link to [Jira](https://issues.apache.org/jira/browse/LUCENE-10559)
   
   ### Solution
   
   Added a `prefilter` and `filterSelectivity` argument to KnnGraphTester to be able to compare pre and post-filtering benchmarks
   
   `filterSelectivity` expresses the selectivity of a filter as proportion of passing docs that are randomly selected. We store these in a FixedBitSet and use this to calculate true KNN as well as in HNSW search
   
   In case of post-filter, we over-select results as `topK / filterSelectivity` to get final hits close to actual requested `topK`
   For pre-filter, we wrap the FixedBitSet in a query and pass it as prefilter argument to KnnVectorQuery


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r888247099


##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -225,6 +225,11 @@ public BitSetIterator getIterator(int contextOrd) {
       return new BitSetIterator(bitSets[contextOrd], cost[contextOrd]);
     }
 
+    public void setBitSet(BitSet bitSet, int cost) {
+      bitSets[ord] = bitSet;

Review Comment:
   I am surprised that iterating and copying the `BitSet` is so expensive -- would you up for sharing some numbers with and without the copying? It might give ideas for optimizations.
   
   Depending on what exactly you're trying to test, I guess we could go back to a strategy where you call `LeafReader#searchNearestVectors` directly, with the filter docs folded into `acceptDocs`? I have the same intuition as @kaivalnp that it feels funny to be benchmarking a workflow that we don't actually follow in a production setting.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r912381221


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##########
@@ -480,11 +537,15 @@ private int[][] getNN(Path docPath, Path queryPath) throws IOException {
     String hash = Integer.toString(Objects.hash(docPath, queryPath, numDocs, numIters, topK), 36);
     String nnFileName = "nn-" + hash + ".bin";
     Path nnPath = Paths.get(nnFileName);
-    if (Files.exists(nnPath) && isNewer(nnPath, docPath, queryPath)) {
+    if (Files.exists(nnPath)

Review Comment:
   Oops I read the logic incorrectly, thanks for clarifying!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] kaivalnp commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
kaivalnp commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r893116012


##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -225,6 +225,11 @@ public BitSetIterator getIterator(int contextOrd) {
       return new BitSetIterator(bitSets[contextOrd], cost[contextOrd]);
     }
 
+    public void setBitSet(BitSet bitSet, int cost) {
+      bitSets[ord] = bitSet;

Review Comment:
   Thank You! I have opened a JIRA [issue](https://issues.apache.org/jira/browse/LUCENE-10606), hope it is okay
   Please feel free to suggest edits / alternate approaches / provide feedback



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r888237781


##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -225,6 +225,11 @@ public BitSetIterator getIterator(int contextOrd) {
       return new BitSetIterator(bitSets[contextOrd], cost[contextOrd]);
     }
 
+    public void setBitSet(BitSet bitSet, int cost) {
+      bitSets[ord] = bitSet;

Review Comment:
   I am surprised that iterating and copying the `BitSet` is so expensive -- would you up for sharing some numbers with and without the copying? It might give ideas for optimizations.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] kaivalnp commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
kaivalnp commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r933082450


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##########
@@ -730,4 +794,61 @@ protected int comparePivot(int j) {
       return Float.compare(score[pivot], score[j]);
     }
   }
+
+  private static class SelectiveQuery extends Query {
+
+    public float selectivity = 1f;
+    private FixedBitSet selectedBits;
+    private long cost;
+
+    @SuppressForbidden(reason = "Uses Math.random()")

Review Comment:
   The latest commit incorporates the suggestions. As for the refactor, should it be a separate issue?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r890545274


##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -225,6 +225,11 @@ public BitSetIterator getIterator(int contextOrd) {
       return new BitSetIterator(bitSets[contextOrd], cost[contextOrd]);
     }
 
+    public void setBitSet(BitSet bitSet, int cost) {
+      bitSets[ord] = bitSet;

Review Comment:
   Thank you @kaivalnp , these results are insightful. I like the direction of the last change "Modifying Collection", it's clearly valuable to be able to access the cached or precomputed `BitSet` directly. 
   
   Stepping back, I am not clear on we are hoping to benchmark with these changes to `KnnGraphTester`. There is indeed overhead from copying the `BitSet`, but this is just how the query works today (until we fix it!) I'm not sure we should add a special test-only code path so that `KnnGraphTester` can work around this limitation. 
   
   What would you think of this plan?
   * Spin off a separate issue around removing overhead from copying `BitSet` when the query is cached or precomputed. Maybe we'll end up with something similar to your change where we access the iterator directly.
   * Either hold off on this PR until that overhead is addressed, or merge it but without a special workaround to prevent copying. To unblock any testing you could fork `KnnGraphTester` locally or `KnnVectorQuery` to add a workaround?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] kaivalnp commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
kaivalnp commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r892189591


##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -225,6 +225,11 @@ public BitSetIterator getIterator(int contextOrd) {
       return new BitSetIterator(bitSets[contextOrd], cost[contextOrd]);
     }
 
+    public void setBitSet(BitSet bitSet, int cost) {
+      bitSets[ord] = bitSet;

Review Comment:
   Thank you @jtibshirani @msokolov for your input!
   
   I agree that we shouldn't modify core classes for tests, and we can create a separate issue for possible collection optimizations
   
   However I feel that this test is important as well, for deciding which query would be suitable for pre-filtering (by comparing with post-filtering + over selection time) and we should address this change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] msokolov commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
msokolov commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r891169857


##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -225,6 +225,11 @@ public BitSetIterator getIterator(int contextOrd) {
       return new BitSetIterator(bitSets[contextOrd], cost[contextOrd]);
     }
 
+    public void setBitSet(BitSet bitSet, int cost) {
+      bitSets[ord] = bitSet;

Review Comment:
   Agreed, the approach of using `Bitset.of` to delegate creation of / access to a bitset of matching docs for the filter Query seems very promising! It would enable us to maintain the "user friendly" Query-based API while accessing underlying Bits when they are available.
   
   I also agree it makes sense to work on that change indepedently / in advance of the tester changes, although it's a little paradoxical since I doubt we would have been able to find this without having worked on the testing side first! Still, to land these changes on lucene/main, I think putting in the support for BitSet access can come first. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] kaivalnp commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
kaivalnp commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r908869810


##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -225,6 +225,11 @@ public BitSetIterator getIterator(int contextOrd) {
       return new BitSetIterator(bitSets[contextOrd], cost[contextOrd]);
     }
 
+    public void setBitSet(BitSet bitSet, int cost) {
+      bitSets[ord] = bitSet;

Review Comment:
   > What would you think of this plan?
   > 
   > * Spin off a separate issue around removing overhead from copying `BitSet` when the query is cached or precomputed. Maybe we'll end up with something similar to your change where we access the iterator directly.
   > * Either hold off on this PR until that overhead is addressed, or merge it but without a special workaround to prevent copying. To unblock any testing you could fork `KnnGraphTester` locally or `KnnVectorQuery` to add a workaround?
   
   Now that the issue for the overhead is addressed, should we look into this PR again?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] kaivalnp commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
kaivalnp commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r931863515


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##########
@@ -730,4 +794,61 @@ protected int comparePivot(int j) {
       return Float.compare(score[pivot], score[j]);
     }
   }
+
+  private static class SelectiveQuery extends Query {
+
+    public float selectivity = 1f;

Review Comment:
   I've reformatted the code to store matching docs in a `FixedBitSet` early on, which can be used to compute true KNN
   A `BitSetQuery` is created to wrap this bitset later, when required
   
   Does this sound good?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] kaivalnp commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
kaivalnp commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r888896850


##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -225,6 +225,11 @@ public BitSetIterator getIterator(int contextOrd) {
       return new BitSetIterator(bitSets[contextOrd], cost[contextOrd]);
     }
 
+    public void setBitSet(BitSet bitSet, int cost) {
+      bitSets[ord] = bitSet;

Review Comment:
   If we use the low-level `searchNearestVectors`, we won't automatically switch to `exactSearch` on reaching the limit of nodes visited (we can still duplicate the search code from KnnVectorQuery to achieve this)
   And possibly some issues addressed [here](https://issues.apache.org/jira/browse/LUCENE-10504) as well?
   
   ___
   
   I tested two other approaches:
   
   Using Reflection: We update the variables forcefully using reflect. It has an advantage of not modifying classes for test purposes, but it makes the test sort of hacky, and might break silently with further changes to `KnnVectorQuery`
   
   selectivity | effective topK | post-filter recall | post-filter time | pre-filter recall | pre-filter time
   -- | -- | -- | -- | -- | --
   0.8 | 125 | 0.966 | 1.57 | 0.976 | 1.62
   0.6 | 166 | 0.961 | 1.94 | 0.981 | 1.97
   0.4 | 250 | 0.958 | 2.68 | 0.986 | 2.64
   0.2 | 500 | 0.961 | 4.78 | 0.992 | 4.51
   0.1 | 1000 | 0.956 | 8.54 | 0.995 | 7.78
   0.01 | 10000 | 0.979 | 58.12 | 1.000 | 9.84
   
   ___
   
   Modifying Collection: Instead of collecting hit-by-hit using a `LeafCollector`, we can break down the search by instantiating a weight, creating scorers, and finally calling `BitSet.of` on it's iterator. If it is backed by a `BitSet`, the collection is optimized (this can be advantageous as `LRUQueryCache` internally uses a `BitSet`, so such iterators will be common)
   Not completely sure of this one, looking for suggestions. Sample [code](https://github.com/apache/lucene/compare/main...kaivalnp:alternate_collection)
   
   selectivity | effective topK | post-filter recall | post-filter time | pre-filter recall | pre-filter time
   -- | -- | -- | -- | -- | --
   0.8 | 125 | 0.961 | 1.56 | 0.976 | 1.64
   0.6 | 166 | 0.960 | 1.93 | 0.980 | 1.98
   0.4 | 250 | 0.961 | 2.68 | 0.987 | 2.68
   0.2 | 500 | 0.960 | 4.76 | 0.991 | 4.55
   0.1 | 1000 | 0.961 | 8.50 | 0.995 | 7.84
   0.01 | 10000 | 0.953 | 58.17 | 1.000 | 9.71



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] kaivalnp commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
kaivalnp commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r888849991


##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -225,6 +225,11 @@ public BitSetIterator getIterator(int contextOrd) {
       return new BitSetIterator(bitSets[contextOrd], cost[contextOrd]);
     }
 
+    public void setBitSet(BitSet bitSet, int cost) {
+      bitSets[ord] = bitSet;

Review Comment:
   Here are some related numbers. This is the baseline (with bulk collection):
   
   selectivity | effective topK | post-filter recall | post-filter time | pre-filter recall | pre-filter time
   -- | -- | -- | -- | -- | --
   0.8 | 125 | 0.964 | 1.58 | 0.975 | 1.60
   0.6 | 166 | 0.962 | 1.94 | 0.981 | 1.97
   0.4 | 250 | 0.960 | 2.70 | 0.986 | 2.64
   0.2 | 500 | 0.963 | 4.76 | 0.991 | 4.51
   0.1 | 1000 | 0.957 | 8.53 | 0.995 | 7.78
   0.01 | 10000 | 0.961 | 58.28 | 1.000 | 9.58
   
   I removed the overloaded `BulkScorer` (and made the `#scorer` return a `ConstantScoreScorer` wrapping the `BitSetIterator` of our query, much like the `BitSetQuery` that you mentioned). This would remove the bulk collection optimization (and switch to doc by doc collection). Here are the numbers:
   
   selectivity | effective topK | post-filter recall | post-filter time | pre-filter recall | pre-filter time
   -- | -- | -- | -- | -- | --
   0.8 | 125 | 0.967 | 1.55 | 0.976 | 19.65
   0.6 | 166 | 0.964 | 1.94 | 0.981 | 17.79
   0.4 | 250 | 0.961 | 2.69 | 0.986 | 14.71
   0.2 | 500 | 0.958 | 4.78 | 0.992 | 11.19
   0.1 | 1000 | 0.959 | 8.53 | 0.994 | 11.50
   0.01 | 10000 | 0.937 | 58.32 | 1.000 | 10.34
   
   The prefilter collection time seems to be high when more docs pass (and are collected one-by-one)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] kaivalnp commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
kaivalnp commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r939715248


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##########
@@ -730,4 +794,61 @@ protected int comparePivot(int j) {
       return Float.compare(score[pivot], score[j]);
     }
   }
+
+  private static class SelectiveQuery extends Query {
+
+    public float selectivity = 1f;
+    private FixedBitSet selectedBits;
+    private long cost;
+
+    @SuppressForbidden(reason = "Uses Math.random()")

Review Comment:
   Sure, will do! This is the [link](https://github.com/apache/lucene/pull/1059)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] kaivalnp commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
kaivalnp commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r885501069


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##########
@@ -730,4 +774,79 @@ protected int comparePivot(int j) {
       return Float.compare(score[pivot], score[j]);
     }
   }
+
+  private static class SelectiveQuery extends Query {
+
+    public float selectivity = 1f;
+    private FixedBitSet selectedBits;
+    private long cost;
+
+    public void createBitSet(int numDocs) {
+      selectedBits = new FixedBitSet(numDocs);
+      if (selectivity == 1f) {
+        selectedBits.set(0, numDocs);
+        cost = numDocs;
+      } else {
+        selectedBits.clear(0, numDocs);
+        cost = 0;
+        for (int i = 0; i < numDocs; i++) {
+          if (Math.random() < selectivity) {
+            selectedBits.set(i);
+            cost++;
+          }
+        }
+      }
+    }
+
+    @Override
+    public String toString(String field) {
+      return "SelectiveQuery[" + selectivity + "]";
+    }
+
+    @Override
+    public void visit(QueryVisitor visitor) {
+      visitor.visitLeaf(this);
+    }
+
+    @Override
+    public boolean equals(Object obj) {
+      return obj instanceof SelectiveQuery && obj.hashCode() == this.hashCode();

Review Comment:
   Thanks! Makes sense to cast and add this additional equality check



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] msokolov commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
msokolov commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r885693494


##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -225,6 +225,11 @@ public BitSetIterator getIterator(int contextOrd) {
       return new BitSetIterator(bitSets[contextOrd], cost[contextOrd]);
     }
 
+    public void setBitSet(BitSet bitSet, int cost) {
+      bitSets[ord] = bitSet;

Review Comment:
   So .. this relies on the fact that `BitSetCollector.doSetNextReader` will have been called prior to this, setting the appropriate `ord`. This is very clever, but it relies heavily on knowledge of internal implementation details that aren't really part of the published API, and this will tie us to this BitSetCollector implementation. I wonder if we could create a new KnnVectorQuery constructor that would accept a Function<Integer, BitSetIterator>, rather than a Query, that would be responsible for returning a BitSetIterator for a given ord?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mocobeta commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
mocobeta commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r886226757


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##########
@@ -730,4 +774,79 @@ protected int comparePivot(int j) {
       return Float.compare(score[pivot], score[j]);
     }
   }
+
+  private static class SelectiveQuery extends Query {
+
+    public float selectivity = 1f;
+    private FixedBitSet selectedBits;
+    private long cost;
+
+    public void createBitSet(int numDocs) {
+      selectedBits = new FixedBitSet(numDocs);
+      if (selectivity == 1f) {
+        selectedBits.set(0, numDocs);
+        cost = numDocs;
+      } else {
+        selectedBits.clear(0, numDocs);
+        cost = 0;
+        for (int i = 0; i < numDocs; i++) {
+          if (Math.random() < selectivity) {
+            selectedBits.set(i);
+            cost++;
+          }
+        }
+      }
+    }
+
+    @Override
+    public String toString(String field) {
+      return "SelectiveQuery[" + selectivity + "]";
+    }
+
+    @Override
+    public void visit(QueryVisitor visitor) {
+      visitor.visitLeaf(this);
+    }
+
+    @Override
+    public boolean equals(Object obj) {
+      return obj instanceof SelectiveQuery && obj.hashCode() == this.hashCode();

Review Comment:
   Thanks for updating, this has passed ErrorProne.
   It still fails in forbiddenApis check. I think you can reproduce it by `gradlew check -x test` or just `gradlew :lucene:core:forbiddenApisTest` on your local machine (and fix it).
   https://github.com/apache/lucene/runs/6671995066?check_suite_focus=true#step:6:615



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r887525645


##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -225,6 +225,11 @@ public BitSetIterator getIterator(int contextOrd) {
       return new BitSetIterator(bitSets[contextOrd], cost[contextOrd]);
     }
 
+    public void setBitSet(BitSet bitSet, int cost) {
+      bitSets[ord] = bitSet;

Review Comment:
   Could we create a new `Query` type backed by a `BitSet`? For example we have a `BitSetQuery` in `TestScorerPerf`, maybe we could pull this out as a test class or just duplicate it. That way we wouldn't have to modify `KnnVectorQuery` just for these test purposes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r932521592


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##########
@@ -730,4 +794,61 @@ protected int comparePivot(int j) {
       return Float.compare(score[pivot], score[j]);
     }
   }
+
+  private static class SelectiveQuery extends Query {
+
+    public float selectivity = 1f;
+    private FixedBitSet selectedBits;
+    private long cost;
+
+    @SuppressForbidden(reason = "Uses Math.random()")

Review Comment:
   This seems like a nice follow-up if you're up for it! For this PR, it's no problem with me to just use `Math.random()`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mocobeta commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
mocobeta commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r885435733


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##########
@@ -730,4 +774,79 @@ protected int comparePivot(int j) {
       return Float.compare(score[pivot], score[j]);
     }
   }
+
+  private static class SelectiveQuery extends Query {
+
+    public float selectivity = 1f;
+    private FixedBitSet selectedBits;
+    private long cost;
+
+    public void createBitSet(int numDocs) {
+      selectedBits = new FixedBitSet(numDocs);
+      if (selectivity == 1f) {
+        selectedBits.set(0, numDocs);
+        cost = numDocs;
+      } else {
+        selectedBits.clear(0, numDocs);
+        cost = 0;
+        for (int i = 0; i < numDocs; i++) {
+          if (Math.random() < selectivity) {
+            selectedBits.set(i);
+            cost++;
+          }
+        }
+      }
+    }
+
+    @Override
+    public String toString(String field) {
+      return "SelectiveQuery[" + selectivity + "]";
+    }
+
+    @Override
+    public void visit(QueryVisitor visitor) {
+      visitor.visitLeaf(this);
+    }
+
+    @Override
+    public boolean equals(Object obj) {
+      return obj instanceof SelectiveQuery && obj.hashCode() == this.hashCode();

Review Comment:
   It looks like ErrorProne complains about this line (cause of the CI failure). 
   https://github.com/apache/lucene/runs/6666473724?check_suite_focus=true#step:6:624
   
   I wonder if `obj.selectivity == this.selectivity` could work instead of checking hashCode().



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r911682445


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##########
@@ -730,4 +794,61 @@ protected int comparePivot(int j) {
       return Float.compare(score[pivot], score[j]);
     }
   }
+
+  private static class SelectiveQuery extends Query {
+
+    public float selectivity = 1f;

Review Comment:
   It would be nice if `SelectiveQuery` created the bit set in the constructor, so all its variables could be immutable.



##########
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##########
@@ -730,4 +794,61 @@ protected int comparePivot(int j) {
       return Float.compare(score[pivot], score[j]);
     }
   }
+
+  private static class SelectiveQuery extends Query {
+
+    public float selectivity = 1f;
+    private FixedBitSet selectedBits;
+    private long cost;
+
+    @SuppressForbidden(reason = "Uses Math.random()")

Review Comment:
   Since this is in the tests module, you could use `org.apache.lucene.tests.util.LuceneTestCase.random` to avoid the suppression.



##########
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##########
@@ -480,11 +537,15 @@ private int[][] getNN(Path docPath, Path queryPath) throws IOException {
     String hash = Integer.toString(Objects.hash(docPath, queryPath, numDocs, numIters, topK), 36);
     String nnFileName = "nn-" + hash + ".bin";
     Path nnPath = Paths.get(nnFileName);
-    if (Files.exists(nnPath) && isNewer(nnPath, docPath, queryPath)) {
+    if (Files.exists(nnPath)

Review Comment:
   Does this mean that we can only compute recall if there is no filter query? Maybe we want to be able to compute recall in this case too?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r933450082


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##########
@@ -730,4 +794,61 @@ protected int comparePivot(int j) {
       return Float.compare(score[pivot], score[j]);
     }
   }
+
+  private static class SelectiveQuery extends Query {
+
+    public float selectivity = 1f;
+    private FixedBitSet selectedBits;
+    private long cost;
+
+    @SuppressForbidden(reason = "Uses Math.random()")

Review Comment:
   Sure, we also allow just opening GitHub PRs now with no corresponding JIRA issue (if that's more convenient).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani merged pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
jtibshirani merged PR #932:
URL: https://github.com/apache/lucene/pull/932


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mocobeta commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
mocobeta commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r885439874


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##########
@@ -730,4 +774,79 @@ protected int comparePivot(int j) {
       return Float.compare(score[pivot], score[j]);
     }
   }
+
+  private static class SelectiveQuery extends Query {
+
+    public float selectivity = 1f;
+    private FixedBitSet selectedBits;
+    private long cost;
+
+    public void createBitSet(int numDocs) {
+      selectedBits = new FixedBitSet(numDocs);
+      if (selectivity == 1f) {
+        selectedBits.set(0, numDocs);
+        cost = numDocs;
+      } else {
+        selectedBits.clear(0, numDocs);
+        cost = 0;
+        for (int i = 0; i < numDocs; i++) {
+          if (Math.random() < selectivity) {
+            selectedBits.set(i);
+            cost++;
+          }
+        }
+      }
+    }
+
+    @Override
+    public String toString(String field) {
+      return "SelectiveQuery[" + selectivity + "]";
+    }
+
+    @Override
+    public void visit(QueryVisitor visitor) {
+      visitor.visitLeaf(this);
+    }
+
+    @Override
+    public boolean equals(Object obj) {
+      return obj instanceof SelectiveQuery && obj.hashCode() == this.hashCode();

Review Comment:
   Sorry, of course it won't work; a cast is needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r892605221


##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -225,6 +225,11 @@ public BitSetIterator getIterator(int contextOrd) {
       return new BitSetIterator(bitSets[contextOrd], cost[contextOrd]);
     }
 
+    public void setBitSet(BitSet bitSet, int cost) {
+      bitSets[ord] = bitSet;

Review Comment:
   Sounds good, it definitely makes sense to do both changes. @kaivalnp would you like to open a new JIRA issue focused on using cached/ precomputed filters directly? I'm also happy to file the issue and tag you for credit + awareness.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] kaivalnp commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
kaivalnp commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r931869194


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##########
@@ -730,4 +794,61 @@ protected int comparePivot(int j) {
       return Float.compare(score[pivot], score[j]);
     }
   }
+
+  private static class SelectiveQuery extends Query {
+
+    public float selectivity = 1f;
+    private FixedBitSet selectedBits;
+    private long cost;
+
+    @SuppressForbidden(reason = "Uses Math.random()")

Review Comment:
   Since the test is run from a static context (main method), I'm unable to use `LuceneTestCase`
   
   We can:
   - Keep using Math.random and suppress warnings
   - Change the code a bit to test using `run` function instead of `main` (not depend on instantiating `KnnGraphTester`)
   
   The second might be better for reproducibility (ability to run using seed)
   
   Any suggestions on this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] kaivalnp commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
kaivalnp commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r939497872


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##########
@@ -730,4 +794,61 @@ protected int comparePivot(int j) {
       return Float.compare(score[pivot], score[j]);
     }
   }
+
+  private static class SelectiveQuery extends Query {
+
+    public float selectivity = 1f;
+    private FixedBitSet selectedBits;
+    private long cost;
+
+    @SuppressForbidden(reason = "Uses Math.random()")

Review Comment:
   @jtibshirani @msokolov Any suggestions/thoughts on this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] msokolov commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
msokolov commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r933513046


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##########
@@ -730,4 +794,61 @@ protected int comparePivot(int j) {
       return Float.compare(score[pivot], score[j]);
     }
   }
+
+  private static class SelectiveQuery extends Query {
+
+    public float selectivity = 1f;
+    private FixedBitSet selectedBits;
+    private long cost;
+
+    @SuppressForbidden(reason = "Uses Math.random()")

Review Comment:
   yes, fine to follow up separately



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r908941633


##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -225,6 +225,11 @@ public BitSetIterator getIterator(int contextOrd) {
       return new BitSetIterator(bitSets[contextOrd], cost[contextOrd]);
     }
 
+    public void setBitSet(BitSet bitSet, int cost) {
+      bitSets[ord] = bitSet;

Review Comment:
   Sounds good! Feel free to ping me for a review once it's updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] kaivalnp commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
kaivalnp commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r911849144


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##########
@@ -480,11 +537,15 @@ private int[][] getNN(Path docPath, Path queryPath) throws IOException {
     String hash = Integer.toString(Objects.hash(docPath, queryPath, numDocs, numIters, topK), 36);
     String nnFileName = "nn-" + hash + ".bin";
     Path nnPath = Paths.get(nnFileName);
-    if (Files.exists(nnPath) && isNewer(nnPath, docPath, queryPath)) {
+    if (Files.exists(nnPath)

Review Comment:
   `readNN` reads the cached topK results (if we ran the test with the same parameters previously). Since our filter is random, we only write and read this cache if it is deterministic (selectivity == 1)
   
   This won't affect computing recall (which will be done in both cases), but we recalculate topK every time for the random filter



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] kaivalnp commented on pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
kaivalnp commented on PR #932:
URL: https://github.com/apache/lucene/pull/932#issuecomment-1197766429

   Sorry for the delay!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] kaivalnp commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
kaivalnp commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r933065705


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##########
@@ -192,6 +207,18 @@ private void run(String... args) throws Exception {
         case "-forceMerge":
           forceMerge = true;
           break;
+        case "-prefilter":
+          prefilter = true;
+          break;
+        case "-filterSelectivity":
+          if (iarg == args.length - 1) {
+            throw new IllegalArgumentException("-filterSelectivity requires a following float");
+          }
+          selectivity = Float.parseFloat(args[++iarg]);

Review Comment:
   Sure, will do



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] kaivalnp commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
kaivalnp commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r935382898


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##########
@@ -730,4 +794,61 @@ protected int comparePivot(int j) {
       return Float.compare(score[pivot], score[j]);
     }
   }
+
+  private static class SelectiveQuery extends Query {
+
+    public float selectivity = 1f;
+    private FixedBitSet selectedBits;
+    private long cost;
+
+    @SuppressForbidden(reason = "Uses Math.random()")

Review Comment:
   > This seems like a nice follow-up if you're up for it! For this PR, it's no problem with me to just use `Math.random()`.
   
   Hey, I did some follow-up on this, and here's the sample [code](https://github.com/kaivalnp/lucene/tree/refactor_graph_tester)
   
   Some key points:
   - Separate indexing and search operations
   - Allow passing index path instead (we can now test HNSW search on any existing index, even those not created by `KnnGraphTest`. This can be beneficial for testing Knn search time on custom indexes)
   - Reduced redundant arguments (we needed to pass `maxConn`, `beamWidth`, `dim` etc even for searching, which aren't required/should be inferred from index automatically)
   - Ability to pass a `seed` (for reproducibility), `maxSegments` (if one wants a single segment index; this was enforced earlier, can be optional), `knnField` (while searching in custom indexes), `cache` (to save on brute force search time using precomputed results)
   - Considered corner cases where if `selectivity` is high, `topK` results may not be found (current implementation requires topK results as far as I know)
   
   Indexing Params:
   ```
   -Doperation=index
   -Ddocs=(path of vec file containing docs)
   -Ddim=(dimension of doc vectors)
   -DnumDocs=(number of vectors)
   -Dindex=(index path to be created)
   -DknnField=(knn field name in index; optional, defaults to `knn`)
   -DmaxConn=(`maxConn` used for indexing)
   -DbeamWidth=(`beamWidth` used for indexing)
   -DmaxSegments=(max segments desired; optional, defaults to no merges)
   ```
   
   Search Params:
   ```
   -Doperation=search
   -Dindex=(path of index; `dim` will be inferred)
   -DknnField=knn field name in index; optional, defaults to `knn`)
   -Dqueries=(path of vec file containing queries)
   -DnumQueries=(number of queries to run)
   -Dcache=(path to cache; read from cache if found, else compute and write new)
   -DtopK=(desired `topK`)
   -DfilterSelectivity=(selectivity of filter; optional, defaults to 1)
   -Dseed=(seed; optional)
   ```
   
   Yet to be added:
   - Index `info` operation (some segment info?)
   - pre/post filter (currently pre-filters by default, add option for over-selection + post-filter)
   - some output formatting (only basic details now)
   
   Some considerations:
   - Not extended `LuceneTestCase` as being a unit test, it can only access limited resources and write to temporary files. However, incorporated a `seed` argument for reproducibility
   - Shifted to JVM arguments for cleaner code (directly access property, no boilerplate required)
   
   Please let me know if this is in the right direction.. (or if I missed something)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r888237781


##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -225,6 +225,11 @@ public BitSetIterator getIterator(int contextOrd) {
       return new BitSetIterator(bitSets[contextOrd], cost[contextOrd]);
     }
 
+    public void setBitSet(BitSet bitSet, int cost) {
+      bitSets[ord] = bitSet;

Review Comment:
   I am surprised that iterating and copying the `BitSet` is so expensive -- would you up for sharing some numbers with and without the copying? It might give ideas for optimizations.
   
   One idea just for this test: you could create a custom `LeafReader` that folds the filter bit set into the live docs. You can seen example of overriding the live docs in `NoLiveDocsDirectoryReader` from `TestKnnVectorQuery`. This is a bit hacky but doesn't feel too bad as a well-scoped approach for testing?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] kaivalnp commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
kaivalnp commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r887637829


##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -225,6 +225,11 @@ public BitSetIterator getIterator(int contextOrd) {
       return new BitSetIterator(bitSets[contextOrd], cost[contextOrd]);
     }
 
+    public void setBitSet(BitSet bitSet, int cost) {
+      bitSets[ord] = bitSet;

Review Comment:
   > So .. this relies on the fact that `BitSetCollector.doSetNextReader` will have been called prior to this, setting the appropriate `ord`. This is very clever, but it relies heavily on knowledge of internal implementation details that aren't really part of the published API, and this will tie us to this BitSetCollector implementation. I wonder if we could create a new KnnVectorQuery constructor that would accept a Function<Integer, BitSetIterator>, rather than a Query, that would be responsible for returning a BitSetIterator for a given ord?
   
   This is interesting and would solve many problems of wrapping the BitSet in a Query (and associated BulkScorer). Given this function, we could add a new constructor to BitSetCollector (to instantiate the internal BitSets from these BitSetIterators instead), and no indexSearcher.search would be required thereafter
   
   My concern is that it creates different search paths and constructors for this function, which will only be used from our test and might complicate the class (Though this addition would be helpful if we want to delegate the task of populating the filtered docs into BitSets outside of KnnVectorQuery)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] kaivalnp commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
kaivalnp commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r933076894


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##########
@@ -730,4 +794,61 @@ protected int comparePivot(int j) {
       return Float.compare(score[pivot], score[j]);
     }
   }
+
+  private static class SelectiveQuery extends Query {
+
+    public float selectivity = 1f;
+    private FixedBitSet selectedBits;
+    private long cost;
+
+    @SuppressForbidden(reason = "Uses Math.random()")

Review Comment:
   We currently use command line arguments to pass the configuration parameters, and hence will be unable to use `RandomizedRunner` (because of `main` having static context)
   
   Using a `RandomizedRunner` will be beneficial, as we can save on the true KNN compute step (as we can reproduce the random setting of bits across runs). The current code does not cache the true KNN results for selective filters (since it is random and won't produce deterministic results)
   
   This true KNN step takes up bulk of the search time for selective searches
   
   We can:
   - Manually add some parameter for the seed, and enable caching of true KNN results
   - Extend `LuceneTestCase` and shift to individual tests (indexing, searching, etc). This might clean up the code, and make it more readable in general, but needs a major refactor in my opinion



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] msokolov commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
msokolov commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r939674449


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##########
@@ -730,4 +794,61 @@ protected int comparePivot(int j) {
       return Float.compare(score[pivot], score[j]);
     }
   }
+
+  private static class SelectiveQuery extends Query {
+
+    public float selectivity = 1f;
+    private FixedBitSet selectedBits;
+    private long cost;
+
+    @SuppressForbidden(reason = "Uses Math.random()")

Review Comment:
   Hi Kaival - would you mind opening a new issue or maybe just a PR where we can discuss? I went looking for this comment and had a little trouble figuring out where it was, I guess because this PR is closed it no longer shows up on the list of open PRs, duh.
   
   As for the specifics of the proposal, your ideas seem good, but (1) I am trying to push another PR for handling low-precision vector quantization that will make a number of changes to the existing KnnGraphTester -sorry! but can you think about also  incorporating those changes? and (2) maybe a smaller refactor exposing a function that accepts all the needed arguments (that can be called from a main that gets its args somehow without changing any of the other implementation details would be a good first step? And then we could have an actual unit test so that we can ensure that other changes are safe. And then we could start restructuring the internals to make it better? We could also change command-line parsing if it seems better to do it some other way, but can we do these as separate PRs please?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] kaivalnp commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
kaivalnp commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r893419386


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##########
@@ -730,4 +774,79 @@ protected int comparePivot(int j) {
       return Float.compare(score[pivot], score[j]);
     }
   }
+
+  private static class SelectiveQuery extends Query {
+
+    public float selectivity = 1f;
+    private FixedBitSet selectedBits;
+    private long cost;
+
+    public void createBitSet(int numDocs) {
+      selectedBits = new FixedBitSet(numDocs);
+      if (selectivity == 1f) {
+        selectedBits.set(0, numDocs);
+        cost = numDocs;
+      } else {
+        selectedBits.clear(0, numDocs);
+        cost = 0;
+        for (int i = 0; i < numDocs; i++) {
+          if (Math.random() < selectivity) {
+            selectedBits.set(i);
+            cost++;
+          }
+        }
+      }
+    }
+
+    @Override
+    public String toString(String field) {
+      return "SelectiveQuery[" + selectivity + "]";
+    }
+
+    @Override
+    public void visit(QueryVisitor visitor) {
+      visitor.visitLeaf(this);
+    }
+
+    @Override
+    public boolean equals(Object obj) {
+      return obj instanceof SelectiveQuery && obj.hashCode() == this.hashCode();

Review Comment:
   Thanks! The latest change is passing `./gradlew check` locally now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] msokolov commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
msokolov commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r932381359


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##########
@@ -730,4 +794,61 @@ protected int comparePivot(int j) {
       return Float.compare(score[pivot], score[j]);
     }
   }
+
+  private static class SelectiveQuery extends Query {
+
+    public float selectivity = 1f;
+    private FixedBitSet selectedBits;
+    private long cost;
+
+    @SuppressForbidden(reason = "Uses Math.random()")

Review Comment:
   Please go ahead and refactor to make it testable



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jtibshirani commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
jtibshirani commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r932513152


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##########
@@ -91,6 +101,9 @@ public class KnnGraphTester {
   private int beamWidth;
   private int maxConn;
   private VectorSimilarityFunction similarityFunction;
+  private FixedBitSet matchDocs;
+  private float selectivity;
+  private boolean prefilter;
 
   @SuppressForbidden(reason = "uses Random()")

Review Comment:
   hmm, not related to your PR but maybe we could remove this suppression as a clean-up.



##########
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##########
@@ -192,6 +207,18 @@ private void run(String... args) throws Exception {
         case "-forceMerge":
           forceMerge = true;
           break;
+        case "-prefilter":
+          prefilter = true;
+          break;
+        case "-filterSelectivity":
+          if (iarg == args.length - 1) {
+            throw new IllegalArgumentException("-filterSelectivity requires a following float");
+          }
+          selectivity = Float.parseFloat(args[++iarg]);

Review Comment:
   Maybe we could also assert that if `prefilter = true`, then `filterSelectivity < 1f`. Because otherwise you could accidentally set `prefilter` without setting a selectivity, which doesn't make much sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] kaivalnp commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
kaivalnp commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r932953915


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##########
@@ -91,6 +101,9 @@ public class KnnGraphTester {
   private int beamWidth;
   private int maxConn;
   private VectorSimilarityFunction similarityFunction;
+  private FixedBitSet matchDocs;
+  private float selectivity;
+  private boolean prefilter;
 
   @SuppressForbidden(reason = "uses Random()")

Review Comment:
   Sure, will do



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] kaivalnp commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
kaivalnp commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r911849373


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##########
@@ -730,4 +794,61 @@ protected int comparePivot(int j) {
       return Float.compare(score[pivot], score[j]);
     }
   }
+
+  private static class SelectiveQuery extends Query {
+
+    public float selectivity = 1f;
+    private FixedBitSet selectedBits;
+    private long cost;
+
+    @SuppressForbidden(reason = "Uses Math.random()")

Review Comment:
   Sure.. Makes sense



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] msokolov commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
msokolov commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r887991586


##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -225,6 +225,11 @@ public BitSetIterator getIterator(int contextOrd) {
       return new BitSetIterator(bitSets[contextOrd], cost[contextOrd]);
     }
 
+    public void setBitSet(BitSet bitSet, int cost) {
+      bitSets[ord] = bitSet;

Review Comment:
   Apparently the BitSet copying is a fairly large cost factor relative to the HNSW search. Also, I do think we would like to eventually have a fully supported (not just for tests) mechanism for this so that we can make more efficient use of cached queries.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] kaivalnp commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

Posted by GitBox <gi...@apache.org>.
kaivalnp commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r887687260


##########
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##########
@@ -225,6 +225,11 @@ public BitSetIterator getIterator(int contextOrd) {
       return new BitSetIterator(bitSets[contextOrd], cost[contextOrd]);
     }
 
+    public void setBitSet(BitSet bitSet, int cost) {
+      bitSets[ord] = bitSet;

Review Comment:
   > Could we create a new `Query` type backed by a `BitSet`? For example we have a `BitSetQuery` in `TestScorerPerf`, maybe we could pull this out as a test class or just duplicate it. That way we wouldn't have to modify `KnnVectorQuery` just for these test purposes.
   
   The current SelectiveQuery class is in essence a Query backed by FixedBitSet. The problem arises during hit collection, which happens doc by doc (so we have to iterate over the entire BitSet, and call collect on set bits), which adds a large arbitrary time. To prevent this copying of our BitSet into BitSetCollector's internal one (bit by bit), I have overloaded the BulkScorer to directly update it's underlying reference
   
   An alternative to adding these modifiers could be using the Reflection package to manually update the variables. This way we won't need to change defined classes for tests (but it might make the test somewhat hacky?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org