You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "gsmiller (via GitHub)" <gi...@apache.org> on 2023/05/19 19:55:36 UTC
[GitHub] [lucene] gsmiller commented on pull request #12312: [DRAFT] GH#12176: TermInSetQuery extends AutomatonQuery

gsmiller commented on PR #12312:
URL: https://github.com/apache/lucene/pull/12312#issuecomment-1555164152

   Here's what I'm seeing so far in benchmarking...
   
   I took a custom benchmarking approach for this, similar to #12151 and other related issues. I did this because, 1) we don't really have benchmark coverage in `luceneutil` (I know, we should address this!), and 2) it lets me test a number of specific scenarios. Honestly, it was just the easiest path for me to get some numbers. My benchmarked is here: 
   [TiSBench.java.txt](https://github.com/apache/lucene/files/11519960/TiSBench.java.txt). If you're interested in running it, you need a little setup. Most of it is obvious, but I'm happy to answer questions if helpful.
   
   The benchmark indexes geonames data (~12MM records). It includes an ID field and a Country Code field (both postings and doc values for each). The ID field is a primary key. The benchmark tasks break down into:
   1. Term disjunctions over the country code field, with varying cardinality: "all" country codes (254 terms), "medium cardinality" country codes (20 different terms), and "low cardinality" country codes (10 different terms). The medium/low cardinality tasks also break down into a set with high and low costs (i.e., common terms and rare terms).
   2. Term disjunctions over the ID field. There are high/medium/low versions of this task (500 terms, 20 terms, 10 terms)
   
   For each task, there are four runs:
   1. Typical postings-based approach with the current TermInSetQuery that extends MultiTermQuery (using prefix coding of terms and the ping-pong intersection with index terms)
   2. Postings-based approach with a new TermInSetQuery that extends AutomatonQuery (delegating to Term#intersect to intersect query terms with index terms)
   3. DocValues-based approach with the current TermInSetQuery
   4. DocValues-based approach with the new TermInSetQuery
   
   I ran postings- and docvalues-based approaches since the term dictionary implementations are different.
   
   In general, the two approaches demonstrate similar latency characteristics, but the automaton approach is a bit worse on the PK field. I dug in a bit with a profiler and I _think_ we're just seeing the overhead of building the automaton. I think this overhead is showing up because the PK query processing is so cheap in general vs. the other tasks that can "hide" the overhead.
   
   So... as of now, I don't see any performance benefits to moving to this approach, and maybe see some regressions. On the other hand, it would be nice to move to this implementation so we could have codec-dependent intersection techniques, which would help address issues like #12280 (bloom filter implementation could have a specific intersection implementation that leverages the bloom filter).
   
   I'll try to run some benchmarks on our Amazon product search application next week just to gather some additional data points. Maybe we can learn more about how this technique might behave on another benchmark data set.
   
   Here are the benchmark results (numbers are query time in ms):
   | Task | Postings MTQ | Postings Automata | DV MTQ | DV Automata |
   |---|---|---|---|---|
   | All Country Code Filter Terms | 507.47 | 506.98 | 82.43 | 104.75 |
   | Task | Postings MTQ | Postings Automata | DV MTQ | DV Automata |
   |---|---|---|---|---|
   | Medium Cardinality + High Cost Country Code Filter Terms | 328.24 | 328.16 | 167.88 | 167.87 |
   | Task | Postings MTQ | Postings Automata | DV MTQ | DV Automata |
   |---|---|---|---|---|
   | Low Cardinality + High Cost Country Code Filter Terms | 70.85 | 70.93 | 169.59 | 169.71 |
   | Task | Postings MTQ | Postings Automata | DV MTQ | DV Automata |
   |---|---|---|---|---|
   | Medium Cardinality + Low Cost Country Code Filter Terms | 0.31 | 0.27 | 73.97 | 73.99 |
   | Task | Postings MTQ | Postings Automata | DV MTQ | DV Automata |
   |---|---|---|---|---|
   | Low Cardinality + Low Cost Country Code Filter Terms | 0.42 | 0.41 | 73.85 | 73.99 |
   | Task | Postings MTQ | Postings Automata | DV MTQ | DV Automata |
   |---|---|---|---|---|
   | High Cardinality PK Filter Terms | 6.34 | 8.57 | 127.34 | 129.63 |
   | Task | Postings MTQ | Postings Automata | DV MTQ | DV Automata |
   |---|---|---|---|---|
   | Medium Cardinality PK Filter Terms | 0.46 | 0.64 | 110.12 | 110.51 |
   | Task | Postings MTQ | Postings Automata | DV MTQ | DV Automata |
   |---|---|---|---|---|
   | Low Cardinality PK Filter Terms | 0.19 | 0.31 | 66.41 | 66.51 |


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org