You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by "gsmiller (via GitHub)" <gi...@apache.org> on 2023/05/18 23:42:47 UTC

[GitHub] [lucene] gsmiller opened a new pull request, #12312: [DRAFT] GH#12176: TermInSetQuery extends AutomatonQuery

gsmiller opened a new pull request, #12312:
URL: https://github.com/apache/lucene/pull/12312

   ### Description
   
   I started experimenting with #12176 to see if we can get any benefits out of having `TermInSetQuery` extend `AutomatonQuery` instead of `MultiTermQuery`. I'm opening this only as a draft for now so that I can "show my work" alongside some benchmark results as I have them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on a diff in pull request #12312: [DRAFT] GH#12176: TermInSetQuery extends AutomatonQuery

Posted by "rmuir (via GitHub)" <gi...@apache.org>.

rmuir commented on code in PR #12312:
URL: https://github.com/apache/lucene/pull/12312#discussion_r1199137356


##########
lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java:
##########
@@ -308,17 +290,83 @@ private void replaceOrRegister(State state) {
     }
   }
 
-  /**
-   * Add a suffix of <code>current</code> starting at <code>fromIndex</code> (inclusive) to state
-   * <code>state</code>.
-   */
-  private void addSuffix(State state, CharSequence current, int fromIndex) {
-    final int len = current.length();
-    while (fromIndex < len) {
-      int cp = Character.codePointAt(current, fromIndex);
-      state = state.newState(cp);
-      fromIndex += Character.charCount(cp);
+  private static class CharacterBasedBuilder extends DaciukMihovAutomatonBuilder {
+    private final CharsRefBuilder scratch = new CharsRefBuilder();
+
+    @Override
+    protected void add(BytesRef current) {
+      // Convert the input UTF-8 bytes to CharsRef so we can use the code points as our transition
+      // labels.

Review Comment:
   hmm, this won't work if someone has binary terms that aren't UTF-8 encoded. I think we shouldn't be building an int32 automaton since we are passing `binary=true` to AutomatonQuery constructor.
   
   Instead we just want to build a binary automaton, where transitions are just bytes. 
   
   PrefixQuery is an example of an automatonquery that supports binary terms and builds a binary automaton directly in this way.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #12312: [DRAFT] GH#12176: TermInSetQuery extends AutomatonQuery

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.

gsmiller commented on PR #12312:
URL: https://github.com/apache/lucene/pull/12312#issuecomment-1555337627

   Hmm... not sure if I've got something setup incorrectly with my JFR settings, but trying to dig into the other tasks, I can't even get the relevant methods to show up in the profiled calls. I attached a debugger and made sure the call flow is what I expect, but I'm not seeing any of the term intersection logic in what's getting profiled. My theory is that the term intersection is so trivial relative to the actually work of computing the postings disjunction that it's just not showing up in the samples, but maybe there's a setting I'm missing that's not sampling in a fine-enough grain? Not sure. But as far as I can tell, with the non-PK tasks, it looks like the term intersection may just be such a trivial part of the query cost that it doesn't matter what approach we take.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #12312: [DRAFT] GH#12176: TermInSetQuery extends AutomatonQuery

Posted by "rmuir (via GitHub)" <gi...@apache.org>.

rmuir commented on PR #12312:
URL: https://github.com/apache/lucene/pull/12312#issuecomment-1554807614

   thanks for getting this started! Will be interested to see how the use of `Terms.intersect` impacts the performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #12312: [DRAFT] GH#12176: TermInSetQuery extends AutomatonQuery

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.

gsmiller commented on PR #12312:
URL: https://github.com/apache/lucene/pull/12312#issuecomment-1555164152

   Here's what I'm seeing so far in benchmarking...
   
   I took a custom benchmarking approach for this, similar to #12151 and other related issues. I did this because, 1) we don't really have benchmark coverage in `luceneutil` (I know, we should address this!), and 2) it lets me test a number of specific scenarios. Honestly, it was just the easiest path for me to get some numbers. My benchmarked is here: 
   [TiSBench.java.txt](https://github.com/apache/lucene/files/11519960/TiSBench.java.txt). If you're interested in running it, you need a little setup. Most of it is obvious, but I'm happy to answer questions if helpful.
   
   The benchmark indexes geonames data (~12MM records). It includes an ID field and a Country Code field (both postings and doc values for each). The ID field is a primary key. The benchmark tasks break down into:
   1. Term disjunctions over the country code field, with varying cardinality: "all" country codes (254 terms), "medium cardinality" country codes (20 different terms), and "low cardinality" country codes (10 different terms). The medium/low cardinality tasks also break down into a set with high and low costs (i.e., common terms and rare terms).
   2. Term disjunctions over the ID field. There are high/medium/low versions of this task (500 terms, 20 terms, 10 terms)
   
   For each task, there are four runs:
   1. Typical postings-based approach with the current TermInSetQuery that extends MultiTermQuery (using prefix coding of terms and the ping-pong intersection with index terms)
   2. Postings-based approach with a new TermInSetQuery that extends AutomatonQuery (delegating to Term#intersect to intersect query terms with index terms)
   3. DocValues-based approach with the current TermInSetQuery
   4. DocValues-based approach with the new TermInSetQuery
   
   I ran postings- and docvalues-based approaches since the term dictionary implementations are different.
   
   In general, the two approaches demonstrate similar latency characteristics, but the automaton approach is a bit worse on the PK field. I dug in a bit with a profiler and I _think_ we're just seeing the overhead of building the automaton. I think this overhead is showing up because the PK query processing is so cheap in general vs. the other tasks that can "hide" the overhead.
   
   So... as of now, I don't see any performance benefits to moving to this approach, and maybe see some regressions. On the other hand, it would be nice to move to this implementation so we could have codec-dependent intersection techniques, which would help address issues like #12280 (bloom filter implementation could have a specific intersection implementation that leverages the bloom filter).
   
   I'll try to run some benchmarks on our Amazon product search application next week just to gather some additional data points. Maybe we can learn more about how this technique might behave on another benchmark data set.
   
   Here are the benchmark results (numbers are query time in ms):
   | Task | Postings MTQ | Postings Automata | DV MTQ | DV Automata |
   |---|---|---|---|---|
   | All Country Code Filter Terms | 507.47 | 506.98 | 82.43 | 104.75 |
   | Task | Postings MTQ | Postings Automata | DV MTQ | DV Automata |
   |---|---|---|---|---|
   | Medium Cardinality + High Cost Country Code Filter Terms | 328.24 | 328.16 | 167.88 | 167.87 |
   | Task | Postings MTQ | Postings Automata | DV MTQ | DV Automata |
   |---|---|---|---|---|
   | Low Cardinality + High Cost Country Code Filter Terms | 70.85 | 70.93 | 169.59 | 169.71 |
   | Task | Postings MTQ | Postings Automata | DV MTQ | DV Automata |
   |---|---|---|---|---|
   | Medium Cardinality + Low Cost Country Code Filter Terms | 0.31 | 0.27 | 73.97 | 73.99 |
   | Task | Postings MTQ | Postings Automata | DV MTQ | DV Automata |
   |---|---|---|---|---|
   | Low Cardinality + Low Cost Country Code Filter Terms | 0.42 | 0.41 | 73.85 | 73.99 |
   | Task | Postings MTQ | Postings Automata | DV MTQ | DV Automata |
   |---|---|---|---|---|
   | High Cardinality PK Filter Terms | 6.34 | 8.57 | 127.34 | 129.63 |
   | Task | Postings MTQ | Postings Automata | DV MTQ | DV Automata |
   |---|---|---|---|---|
   | Medium Cardinality PK Filter Terms | 0.46 | 0.64 | 110.12 | 110.51 |
   | Task | Postings MTQ | Postings Automata | DV MTQ | DV Automata |
   |---|---|---|---|---|
   | Low Cardinality PK Filter Terms | 0.19 | 0.31 | 66.41 | 66.51 |


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on a diff in pull request #12312: [DRAFT] GH#12176: TermInSetQuery extends AutomatonQuery

Posted by "rmuir (via GitHub)" <gi...@apache.org>.

rmuir commented on code in PR #12312:
URL: https://github.com/apache/lucene/pull/12312#discussion_r1199139773


##########
lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java:
##########
@@ -308,17 +290,83 @@ private void replaceOrRegister(State state) {
     }
   }
 
-  /**
-   * Add a suffix of <code>current</code> starting at <code>fromIndex</code> (inclusive) to state
-   * <code>state</code>.
-   */
-  private void addSuffix(State state, CharSequence current, int fromIndex) {
-    final int len = current.length();
-    while (fromIndex < len) {
-      int cp = Character.codePointAt(current, fromIndex);
-      state = state.newState(cp);
-      fromIndex += Character.charCount(cp);
+  private static class CharacterBasedBuilder extends DaciukMihovAutomatonBuilder {
+    private final CharsRefBuilder scratch = new CharsRefBuilder();
+
+    @Override
+    protected void add(BytesRef current) {
+      // Convert the input UTF-8 bytes to CharsRef so we can use the code points as our transition
+      // labels.

Review Comment:
   nevermind, sorry for the noise. i read the diff wrong and got the char/binary mixed up. i think you are doing it right



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #12312: [DRAFT] GH#12176: TermInSetQuery extends AutomatonQuery

Posted by "rmuir (via GitHub)" <gi...@apache.org>.

rmuir commented on PR #12312:
URL: https://github.com/apache/lucene/pull/12312#issuecomment-1555172675

   hmm, disappointing. Was hoping to see gains on the terms dictionary since it optimizes `intersect`. wonder what is going on.
   
   Of course docvalues impl doesn't optimize `intersect` in any way, so it isn't surprising that we don't see gains there, but it is something we could improve later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #12312: [DRAFT] GH#12176: TermInSetQuery extends AutomatonQuery

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.

gsmiller commented on PR #12312:
URL: https://github.com/apache/lucene/pull/12312#issuecomment-1555244204

   OK, here's a method profiler diff for the "High Cardinality PK" task, comparing two postings approaches—one that's using the current MultiTermQuery version, and one using AutomatonQuery ("LegacyTermInSetQuery" is our current version extending MultiTermQuery). The green colored frames show where the AutomatonQuery is spending less time compared to the MultiTermQuery version, and red is the opposite. Eyeballing this, it seems to confirm that it's a little more expensive to build the automaton than to do the prefix-coding, but a little less expensive to do the term intersection with the automaton approach (down in the codec's optimized intersect). The net/net though is that the AutomatonQuery approach in this case is slower by ~35%.
   <img width="1733" alt="Screen Shot 2023-05-19 at 1 55 36 PM" src="https://github.com/apache/lucene/assets/16479560/c251415c-877a-4c5e-b8a1-ecb5c8546282">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org