You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/08/02 10:22:45 UTC

[GitHub] [lucene] kaivalnp commented on a diff in pull request #932: LUCENE-10559: Add Prefilter Option to KnnGraphTester

kaivalnp commented on code in PR #932:
URL: https://github.com/apache/lucene/pull/932#discussion_r935382898


##########
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##########
@@ -730,4 +794,61 @@ protected int comparePivot(int j) {
       return Float.compare(score[pivot], score[j]);
     }
   }
+
+  private static class SelectiveQuery extends Query {
+
+    public float selectivity = 1f;
+    private FixedBitSet selectedBits;
+    private long cost;
+
+    @SuppressForbidden(reason = "Uses Math.random()")

Review Comment:
   > This seems like a nice follow-up if you're up for it! For this PR, it's no problem with me to just use `Math.random()`.
   
   Hey, I did some follow-up on this, and here's the sample [code](https://github.com/kaivalnp/lucene/tree/refactor_graph_tester)
   
   Some key points:
   - Separate indexing and search operations
   - Allow passing index path instead (we can now test HNSW search on any existing index, even those not created by `KnnGraphTest`. This can be beneficial for testing Knn search time on custom indexes)
   - Reduced redundant arguments (we needed to pass `maxConn`, `beamWidth`, `dim` etc even for searching, which aren't required/should be inferred from index automatically)
   - Ability to pass a `seed` (for reproducibility), `maxSegments` (if one wants a single segment index; this was enforced earlier, can be optional), `knnField` (while searching in custom indexes), `cache` (to save on brute force search time using precomputed results)
   - Considered corner cases where if `selectivity` is high, `topK` results may not be found (current implementation requires topK results as far as I know)
   
   Indexing Params:
   ```
   -Doperation=index
   -Ddocs=(path of vec file containing docs)
   -Ddim=(dimension of doc vectors)
   -DnumDocs=(number of vectors)
   -Dindex=(index path to be created)
   -DknnField=(knn field name in index; optional, defaults to `knn`)
   -DmaxConn=(`maxConn` used for indexing)
   -DbeamWidth=(`beamWidth` used for indexing)
   -DmaxSegments=(max segments desired; optional, defaults to no merges)
   ```
   
   Search Params:
   ```
   -Doperation=search
   -Dindex=(path of index; `dim` will be inferred)
   -DknnField=knn field name in index; optional, defaults to `knn`)
   -Dqueries=(path of vec file containing queries)
   -DnumQueries=(number of queries to run)
   -Dcache=(path to cache; read from cache if found, else compute and write new)
   -DtopK=(desired `topK`)
   -DfilterSelectivity=(selectivity of filter; optional, defaults to 1)
   -Dseed=(seed; optional)
   ```
   
   Yet to be added:
   - Index `info` operation (some segment info?)
   - pre/post filter (currently pre-filters by default, add option for over-selection + post-filter)
   - some output formatting (only basic details now)
   
   Some considerations:
   - Not extended `LuceneTestCase` as being a unit test, it can only access limited resources and write to temporary files. However, incorporated a `seed` argument for reproducibility
   - Shifted to JVM arguments for cleaner code (directly access property, no boilerplate required)
   
   Please let me know if this is in the right direction.. (or if I missed something)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org