You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2023/01/17 19:33:14 UTC

[GitHub] [lucene] gsmiller opened a new pull request, #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

gsmiller opened a new pull request, #12089:
URL: https://github.com/apache/lucene/pull/12089

   ### Description
   
   This is a DRAFT PR to sketch out the idea of a "self optimizing" TermInSetQuery. The idea is to build on the new `KeywordField` being proposed in #12054, which indexes both postings and DV data. It takes a bit of a different approach though as compared to `IndexOrDocValuesQuery` by "internally" deciding whether to use postings vs. doc values (at the segment granularity).
   
   Please note that there are many TODOs in here and I haven't done any benchmarking, etc. I've written light tests to convince myself it works (I've made sure all branches have been exercised), but it's highly likely there are bugs.
   
   I'm putting this out there for discussion only. My plan is to benchmark this approach as a next step, but I wanted to float the idea early to see if anyone has feedback or other ideas. Also, if someone loves the idea and wants to run with it, please go for it. I'm pretty busy for the next couple of week and I'm not sure when I'll come back to this.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1419161514

   @rmuir I grabbed your patch for adding a `ScoreSupplier` to `DocValuesTermsQuery` (#12129) and reran benchmarks. The gap between IndexOrDV and the "self-optimizing" TermInSetQuery have closed with this change. It looks like I was wrong about the way IndexOrDV plans PK-type queries. I thought it was choosing to use doc values based on what I saw in profiler output, but what I was really seeing was the up-front ordinal lookups in `DocValuesTermsQuery` as a result of not having the `ScoreSupplier` abstraction. With your patch, that goes away.
   
   The only gap that remains now is when the field is _not_ a PK-style field but the terms being used in the disjunction have a low aggregate cost (relative to the other terms in the field; e.g., `Medium Cardinality + Low Cost Country Code Filter Terms`). In this case, IndexOrDV is always using doc values (due to the field-level stats used for cost), but—by doing some term-seeking—we could better decide to use postings. 
   
   Here are updated benchmark results: [TiSBenchResults_Simplified_DVSSPatch.md.txt](https://github.com/apache/lucene/files/10663766/TiSBenchResults_Simplified_DVSSPatch.md.txt)
   (Note that "low cardinality" cases are kind of terrible still because the TiSQuery is being rewritten to a BooleanQuery)
   
   > to me the issue is a problem with TermInSetQuery ScorerSupplier cost method
   
   +1. Maybe there's a way to address this remaining gap by being smarter about the cost function without term-seeking? That would be ideal.
   
   I also played around with the idea of a "cost iterator" abstraction on `ScoreSupplier` as a way to allow something like `TermInSetQuery` to provide incremental costs to `IndexOrDocValuesQuery` as it term-seeks. This feels clunky to me, and I'm not proposing it as a "good idea" right now, but I'll share it as another approach. I was able to get comparable benchmark results with this technique, and it still allows `IndexOrDocValuesQuery` to "own" the decision between postings and doc values: https://github.com/apache/lucene/compare/main...gsmiller:lucene:explore/tis-score-supplier-cost-iterator. Benchmark results for this approach are here: [TiSBenchResults_SSIterator.md.txt](https://github.com/apache/lucene/files/10663947/TiSBenchResults_SSIterator.md.txt). It feels overly complicated though.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1407401560

   also personally i wouldn't target the sandbox at all when attacking this problem. such queries went into the sandbox before, primarily i think because they were very tricky to use. There wasn't such a thing as IndexOrDocValuesQuery and often you'd get worse performance when using them. 
   
   nowadays, we can expose them more cleanly, via e.g. `KeywordField.newSetQuery` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1416895938

   Thanks @rmuir! I'll have a look at the PRs tomorrow. I've got a bad case of jet lag kicking in at the moment and my brain is turning to mush. In the meantime, I took a pass at simplifying the proposal here in this PR. There was a lot of complexity (as you pointed out) for edge-cases and such. I tried to keep it focused on just the core idea of seeking some terms to get a better cost estimate, in a way that adds minimal complexity. Benchmark results are about the same as before, so quite a bit simpler and no real performance difference: [TiSBenchResults_Simplified.md.txt](https://github.com/apache/lucene/files/10610038/TiSBenchResults_Simplified.md.txt)
   
   Of course, this still relies on optimizing internally instead of leveraging `IndexOrDocValuesQuery`, so I recognize it's not getting at the core of your concern. I figured it would be good to at least simplify a bit though so we have something a little cleaner/simpler to discuss.
   
   Thanks again for the feedback and related PRs!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on a diff in pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

Posted by GitBox <gi...@apache.org>.
rmuir commented on code in PR #12089:
URL: https://github.com/apache/lucene/pull/12089#discussion_r1072841550


##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/queries/TermInSetQuery.java:
##########
@@ -0,0 +1,527 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.queries;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.Objects;
+import java.util.Set;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.LeafReader;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.PostingsEnum;
+import org.apache.lucene.index.SortedDocValues;
+import org.apache.lucene.index.SortedSetDocValues;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.index.TermState;
+import org.apache.lucene.index.Terms;
+import org.apache.lucene.index.TermsEnum;
+import org.apache.lucene.search.ConstantScoreScorer;
+import org.apache.lucene.search.ConstantScoreWeight;
+import org.apache.lucene.search.DisiPriorityQueue;
+import org.apache.lucene.search.DisiWrapper;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.MatchNoDocsQuery;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.search.QueryVisitor;
+import org.apache.lucene.search.ScoreMode;
+import org.apache.lucene.search.Scorer;
+import org.apache.lucene.search.ScorerSupplier;
+import org.apache.lucene.search.TermQuery;
+import org.apache.lucene.search.TwoPhaseIterator;
+import org.apache.lucene.search.Weight;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.DocIdSetBuilder;
+import org.apache.lucene.util.LongBitSet;
+import org.apache.lucene.util.PriorityQueue;
+
+public class TermInSetQuery extends Query {
+  // TODO: tunable coefficients. need to actually tune them (or maybe these are too complex and not
+  // useful)
+  private static final double J = 1.0;
+  private static final double K = 1.0;
+  // L: postings lists under this threshold will always be "pre-processed" into a bitset
+  private static final int L = 512;
+  // M: max number of clauses we'll manage/check during scoring (these remain "unprocessed")
+  private static final int M = Math.min(IndexSearcher.getMaxClauseCount(), 64);
+
+  private final String field;
+  // TODO: Not particularly memory-efficient; could use prefix-coding here but sorting isn't free
+  private final BytesRef[] terms;

Review Comment:
   When I think of worst-case, I'm assuming where these term dictionaries don't even fit in RAM. We should really always assume this stuff doesn't fit in RAM :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on a diff in pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

Posted by GitBox <gi...@apache.org>.
gsmiller commented on code in PR #12089:
URL: https://github.com/apache/lucene/pull/12089#discussion_r1072872477


##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/queries/TermInSetQuery.java:
##########
@@ -0,0 +1,527 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.queries;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.Objects;
+import java.util.Set;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.LeafReader;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.PostingsEnum;
+import org.apache.lucene.index.SortedDocValues;
+import org.apache.lucene.index.SortedSetDocValues;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.index.TermState;
+import org.apache.lucene.index.Terms;
+import org.apache.lucene.index.TermsEnum;
+import org.apache.lucene.search.ConstantScoreScorer;
+import org.apache.lucene.search.ConstantScoreWeight;
+import org.apache.lucene.search.DisiPriorityQueue;
+import org.apache.lucene.search.DisiWrapper;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.MatchNoDocsQuery;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.search.QueryVisitor;
+import org.apache.lucene.search.ScoreMode;
+import org.apache.lucene.search.Scorer;
+import org.apache.lucene.search.ScorerSupplier;
+import org.apache.lucene.search.TermQuery;
+import org.apache.lucene.search.TwoPhaseIterator;
+import org.apache.lucene.search.Weight;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.DocIdSetBuilder;
+import org.apache.lucene.util.LongBitSet;
+import org.apache.lucene.util.PriorityQueue;
+
+public class TermInSetQuery extends Query {
+  // TODO: tunable coefficients. need to actually tune them (or maybe these are too complex and not
+  // useful)
+  private static final double J = 1.0;
+  private static final double K = 1.0;
+  // L: postings lists under this threshold will always be "pre-processed" into a bitset
+  private static final int L = 512;
+  // M: max number of clauses we'll manage/check during scoring (these remain "unprocessed")
+  private static final int M = Math.min(IndexSearcher.getMaxClauseCount(), 64);
+
+  private final String field;
+  // TODO: Not particularly memory-efficient; could use prefix-coding here but sorting isn't free
+  private final BytesRef[] terms;

Review Comment:
   That's a good point/perspective. I'm convinced. It's easy enough to borrow those ideas from our existing `TermInSetQuery`, so I'll do that. Thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on a diff in pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

Posted by GitBox <gi...@apache.org>.
rmuir commented on code in PR #12089:
URL: https://github.com/apache/lucene/pull/12089#discussion_r1072830614


##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/queries/TermInSetQuery.java:
##########
@@ -0,0 +1,527 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.queries;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.Objects;
+import java.util.Set;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.LeafReader;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.PostingsEnum;
+import org.apache.lucene.index.SortedDocValues;
+import org.apache.lucene.index.SortedSetDocValues;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.index.TermState;
+import org.apache.lucene.index.Terms;
+import org.apache.lucene.index.TermsEnum;
+import org.apache.lucene.search.ConstantScoreScorer;
+import org.apache.lucene.search.ConstantScoreWeight;
+import org.apache.lucene.search.DisiPriorityQueue;
+import org.apache.lucene.search.DisiWrapper;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.MatchNoDocsQuery;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.search.QueryVisitor;
+import org.apache.lucene.search.ScoreMode;
+import org.apache.lucene.search.Scorer;
+import org.apache.lucene.search.ScorerSupplier;
+import org.apache.lucene.search.TermQuery;
+import org.apache.lucene.search.TwoPhaseIterator;
+import org.apache.lucene.search.Weight;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.DocIdSetBuilder;
+import org.apache.lucene.util.LongBitSet;
+import org.apache.lucene.util.PriorityQueue;
+
+public class TermInSetQuery extends Query {
+  // TODO: tunable coefficients. need to actually tune them (or maybe these are too complex and not
+  // useful)
+  private static final double J = 1.0;
+  private static final double K = 1.0;
+  // L: postings lists under this threshold will always be "pre-processed" into a bitset
+  private static final int L = 512;
+  // M: max number of clauses we'll manage/check during scoring (these remain "unprocessed")
+  private static final int M = Math.min(IndexSearcher.getMaxClauseCount(), 64);
+
+  private final String field;
+  // TODO: Not particularly memory-efficient; could use prefix-coding here but sorting isn't free
+  private final BytesRef[] terms;

Review Comment:
   but this is really bad for performance to be unsorted: it means we do a bunch of random access lookups in the terms dictionaries: looping over these unsorted terms and doing seekExact, looping over these unsorted terms and doing lookupOrd that could instead easily be sequential/more friendly.
   
   Given that sometimes looking up all the terms is a very heavy cost for this thing, calling `Arrays.sort()` in the ctor seems like an easy-win/no-brainer. As is the prefix-coding to keep the RAM usage lower in the worst-case. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1419344879

   > but there is also one benchmark where it does substantially worse ("Low Cardinality + High Cost Country Code Filter Terms").
   
   100%. The issue here is that `TermInSetQuery` gets rewritten to a `BooleanQuery` because there are fewer than 16 terms, so it doesn't have a chance to "self-optimize" to use doc values. We can fix this by not eagerly rewriting to a `BooleanQuery`, but I held off doing that for now. So this is "easily" fixable I think.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

Posted by GitBox <gi...@apache.org>.
gsmiller commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1386068784

   @rmuir 
   > I was naively thinking to try to the same approach with the DocValuesTermsQuery that is in sandbox...
   
   I think that's probably a good place to start honestly. I was thinking of introducing this in the sandbox module initially and not actually hooking it into `KeywordField`. Then maybe following up by graduating it and hooking it in? But maybe it's worth doing initially if we can get it right? I dunno.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1416779603

   Hmm, @gsmiller I really personally don't like all this complexity and optimizations mixed into `TermInSetQuery`. It becomes almost unmanageable.
   
   This is why I like the IndexOrDocValues approach: we can have two simple queries instead of one extremely complicated one.
   
   If it is a must that we create a giant HeroicTermInSetQuery that tries to do anything and everything, then let's name it just that: also make it package private in `org.apache.lucene.document`, it can be hooked in by KeywordField.
   
   But I'd really like us to consider the cost in the complexity here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1416866945

   @rmuir thanks for the feedback. +1 to making sure we're getting the right balance of complexity vs. benefit. If you strip away some of the complexity, I think the main benefit of the approach I'm proposing here is the ability to term-seek in a somewhat sensible way to come up with a better cost estimate. The challenge with the existing `TermInSetQuery` is that it's only looking at field level stats to guess at the disjunction cost. This can lead to a situation where the cost is significantly overestimated, causing `IndexOrDocValues` to use doc values when postings would be more appropriate. The advantage of optimizing within the query implementation is that it can short-circuit the term seeking when it reaches a point where it believes doc values will be more appropriate. I'm not sure how we'd do this in a sensible way in the existing `ScoreSupplier#cost` abstraction.
   
   So this shows up with something like `Medium Cardinality + Low Cost Country Code Filter Terms` in the benchmarks, as well as the PK cases. The challenge with the PK case is a bit unique. While `ScoreSupplier#cost` in `TermInSetQuery` already handles PK cost estimate in a sensible way, it's missing the fact that _most_ terms will not be present in a given segment (assuming the index is segmented). Again, this is where term-seeking helps provide a tighter cost estimate. Both of these are things we may be able to tweak in the current `TermInSet` cost approximation so that we don't need a special query implementation.
   
   As for where the proposed implementation is worse than `IndexOrDocValues`, I think the main culprit is `Low Cardinality + High Cost Country Code Filter Terms`. This is because I kept the existing `rewrite` logic to rewrite to a boolean query if there are fewer than 16 terms. So it will _always_ use a postings approach with < 16 terms, even though that's not the right thing in this situation. This is something I was planning to look into more. When I removed this rewrite logic, it performed much better in this case, but I bumped into some unit tests elsewhere that assume the rewrite, and I thought it would help to revise the rewrite logic in a separate pass rather than try to solve for that now.
   
   Anyway, back to the point about complexity vs. benefit, I 100% agree that relying on `IndexOrDocValues` would be preferable if we can solve for cost over-estimating. I'd pursued this path after some feedback from @jpountz (https://github.com/apache/lucene/pull/11741#issuecomment-1241681411) that it may make sense to take this type of approach, but I'm all for keeping it simple if we can here. I'll keep an eye out for your draft PR. Thanks for the engagement on this!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1419214500

   That's good that it made progress. I will look more into it tonight. I want to get these patches landed to simplify benchmarking.
   
   its true there is one benchmark where this combined query does better ("Medium Cardinality + Low Cost Country Code Filter Terms") but there is also one benchmark where it does substantially worse ("Low Cardinality + High Cost Country Code Filter Terms").
   
   so net/net i would say they are equivalent. But I will look into this case to see if we can still do better.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1416767759

   @rmuir I modified the PR to update the existing `TermInSetQuery` in-place, instead of introducing a new sandbox concept. I've added some test coverage and re-ran all the benchmarks. The updated benchmark code is here: [TiSBench.java.txt](https://github.com/apache/lucene/files/10608857/TiSBench.java.txt). The results are here: 
   [TiSBenchResults.md.txt](https://github.com/apache/lucene/files/10608858/TiSBenchResults.md.txt).
   
   I think this is ready for a more thorough review now if you (or anyone else) has the time. My plan is to reference this query directly from something like `KeywordField#newSetQuery` after #12054 is merged (or could be part of that PR). Based on the benchmarks, I wouldn't expect this change to cause any significant regressions to existing `TermInSetQuery` users, and it looks more promising than trying to leverage `IndexOrDocValues`. I'm sure there's room to tweak/improve it further, but my hope is to get something initial merged and then iterate as it makes sense. Thanks for your feedback so far!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller closed pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller closed pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available
URL: https://github.com/apache/lucene/pull/12089


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1416839513

   I think some of the benchmarks needs to be revisited here anyway. `DocValuesTermsQuery` is not in great shape, which may lead to bad results from IndexOrDocValuesQuery and cause confusion. I will clean it up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

Posted by GitBox <gi...@apache.org>.
rmuir commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1385954675

   Thanks for looking at this. I can alter benchmark from #12087 to test this case, honestly we could even just take the benchmark and index the numeric field as a string instead as a start :)
   
   In the case of numeric fields, we just had crazy query in the sandbox which is better as e.g. NumericDocValues.newSlowSetQuery. And then we hooked into IntField etc as newSlowSetQuery with the IndexOrDocValuesQuery. For that one, we had to fix PointInSetQuery to support ScorerSupplier etc (but TermInSetQuery already has this cost estimation)
   
   I was naively thinking to try to the same approach with the DocValuesTermsQuery that is in sandbox... though I anticipated maybe more trickiness with inverted index as opposed to points. But maybe IndexOrDocValuesQuery would surprise me again, of course its probably worth exploring anyway, we could compare the approaches. I do like IndexOrDocValuesQuery for solving these problems and if we can improve it to keep it generic, i'd definitely be in favor of that. But whatever is fastest wins :)
   
   I do think its important to add these fields such as KeywordField and put this best-practice logic behind simple methods so that it is easier on the user.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

Posted by GitBox <gi...@apache.org>.
rmuir commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1385982331

   I modified the benchmark from #12087 to just use StringField instead of IntField. The queries are supposed to be "hard" in that I'm not trying to benchmark what is necessarily typical, instead target "hard" stuff that is more worst-case (e.g. we shouldn't cause regressions vs `new PointInSetQuery()` in main branch today): [StringSetBenchmark.java.txt](https://github.com/apache/lucene/files/10438949/StringSetBenchmark.java.txt)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1416786686

   another way to say it: I see a lot more optimizations and conditionals in the query than I see wins in the benchmark?
   
   which of these truly matter? Can we remove some to simplify the new query?
   Is there a key optimization or two that can simply be folded into existing TermInSetQuery/DocValuesTermsQuery cost() method so we can improve IndexOrDocValuesQuery?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1416891188

   > Anyway, back to the point about complexity vs. benefit, I 100% agree that relying on `IndexOrDocValues` would be preferable if we can solve for cost over-estimating. I'd pursued this path after some feedback from @jpountz ([#11741 (comment)](https://github.com/apache/lucene/pull/11741#issuecomment-1241681411)) that it may make sense to take this type of approach, but I'm all for keeping it simple if we can here. I'll keep an eye out for your draft PR. Thanks for the engagement on this!
   
   Yeah overall my concern is not so much with your specific query, just that it doesn't scale to many other queries. It is only matter of time before someone says "hey, we can really speed up many other slow use-cases by adding KeywordField.newPrefixQuery, newWildCardQuery, etc by using the docvalues", and it's true (for selective queries, you could avoid intersection against any large terms dict at all and do a a couple per-doc very-high-cost lookupTerm + runautomaton match). So it's an example where i'd like to avoid a mess: it would be better to fix apis such as MultiTermQuery, IndexOrDocValuesQuery, ScorerSupplier, etc so that we can make things work cleanly with simpler queries that are easier to test correctness of and maintain.
   
   But practically, right now benchmark is not very fair comparison because some of these queries just have obvious straightforward performance deficiencies, DocValuesTermsQuery especially.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on a diff in pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on code in PR #12089:
URL: https://github.com/apache/lucene/pull/12089#discussion_r1097623311


##########
lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java:
##########
@@ -258,13 +271,41 @@ public Matches matches(LeafReaderContext context, int doc) throws IOException {
        * On the given leaf context, try to either rewrite to a disjunction if there are few matching
        * terms, or build a bitset containing matching docs.
        */
-      private WeightOrDocIdSet rewrite(LeafReaderContext context) throws IOException {
+      private WeightOrDocIdSet rewrite(
+          LeafReaderContext context, long leadCost, boolean isPrimaryKeyField, DocValuesType dvType)
+          throws IOException {
         final LeafReader reader = context.reader();
 
         Terms terms = reader.terms(field);
         if (terms == null) {
           return null;
         }
+
+        long costThreshold = Long.MAX_VALUE;
+        if (dvType == DocValuesType.SORTED || dvType == DocValuesType.SORTED_SET) {
+          // Establish a threshold for switching to doc values. Give postings a significant
+          // advantage for the primary-key case, since many of the primary-key terms may not
+          // actually be in this segment. The 8x factor is arbitrary, based on IndexOrDVQuery,
+          // but has performed well in benchmarks:
+          costThreshold = isPrimaryKeyField ? leadCost << 3 : leadCost;
+
+          if (termData.size() > costThreshold) {
+            // If the number of terms is > the number of candidates, DV should perform better.

Review Comment:
   I'm not sure actually. The up-front term-seeking you refer to is certainly a cost, but it doesn't scale with the number of lead hits. So this can still be cheaper. But also, +1 to the idea of trying out on-demand term seeking for these situations!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1407307849

   > attach it to a github comment
   
   That works! Here's how I benchmarked. One note if you're interested in running this is to make sure to shuffle the genomes data prior to running or you get very skewed results. This is because the file is sorted by country code, so a postings-based approach is heavily favored by the natural index sorting if you index the lines in this order.
   [TiSBench.txt](https://github.com/apache/lucene/files/10526083/TiSBench.txt)
   
   
   I'm actually a bit surprised/impressed that our existing `IndexOrDocValues` functionality works as well as it does across these queries, given how rough of an estimate `#cost()` is on the current `TermInSetQuery`. There are some clear cases where term-seeking and using term-level stats in a heuristic helps make better decisions, but I expected the difference to be more pronounced.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1418000532

   Thanks, the patch is super-helpful as it illustrates the real issue. I am looking into it, to me the issue is a problem with `TermInSetQuery` ScorerSupplier cost method. Points uses a different approach that works great in all my benchmarks, so it seems like terms/postings might need one too.
   
   I didn't get distracted, just saw related issues on the DV side that are easily fixed: and i've still got a few TODOs for these queries remaining. It is one advantage of having the separate queries: makes it super easy to just benchmark the pure-DV case and find/fix issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1416790181

   i'll work up a draft PR trying to work it from the opposite direction (keep queries split apart) so my suggestions will be more concrete. You did the hard part already and made a benchmark, I will reuse it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1421121653

   @rmuir thanks for the feedback and spending time having a look. I'm going to try summarizing where we've landed to make sure we're on the same page. I think we both agree on the following, but please let me know if I'm missing anything?
   1. Ideally we want to leverage `IndexOrDocValues` instead of putting custom solutions into queries themselves (e.g., we'd prefer not to have query implementations decide between postings / doc values). This decoupling allows more reuse and separate innovations in the separate queries (e.g., we can independently optimize `TermInSetQuery` and `DocValuesTermsQuery`). This also provides more reuse across various use-cases, as opposed to lots of different query implementations having to re-invent this postings vs. doc values logic.
   2. The only inputs `IndexOrDocValues` has for for making a decision are, 1) "lead cost" and 2) the stated "cost" of the wrapper query `ScoreSupplier`s. `TermInSetQuery` may significantly over-estimate cost in its currently implementation as it guarantees a cost ceiling, only looking at field-level statistics.
   3. "Primary key" type cases work well with `IndexOrDocValues` since `TermInSetQuery` cost is able to recognize this situation and provide a better cost estimate.
   4. The case where `IndexOrDocValues` doesn't do a great job is when the indexed terms cover many documents in general, but the terms used in the specific query cover few.
   
   Given this, I tend to agree that moving away from `IndexOrDocValues` to only solve one type of scenario (mentioned in `#4` above), probably isn't the right decision. To move forward, I'm going to explore other ways to solve this case while relying on the `IndexOrDocValues` abstraction. I can think of a couple ways to approach the problem:
   1. Alter the way we compute `TermInSetQuery#cost`. Instead of providing a ceiling, we could look at some sort of "average" cost (e.g., determine average number of docs per term and multiply that out by the number of terms in the query). We could also term-seek up to some fixed number of terms up-front to get more information. This obviously ads cost but may be worth it
   2. While significantly more complex, if it's not sufficient to use field-level statistics and a single cost value, I may further explore the idea of a "cost iterator" in `ScoreSupplier`. This still feels too complex to me and I don't really want to go in this direction.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on a diff in pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

Posted by GitBox <gi...@apache.org>.
rmuir commented on code in PR #12089:
URL: https://github.com/apache/lucene/pull/12089#discussion_r1072855208


##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/queries/TermInSetQuery.java:
##########
@@ -0,0 +1,527 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.queries;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.Objects;
+import java.util.Set;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.LeafReader;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.PostingsEnum;
+import org.apache.lucene.index.SortedDocValues;
+import org.apache.lucene.index.SortedSetDocValues;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.index.TermState;
+import org.apache.lucene.index.Terms;
+import org.apache.lucene.index.TermsEnum;
+import org.apache.lucene.search.ConstantScoreScorer;
+import org.apache.lucene.search.ConstantScoreWeight;
+import org.apache.lucene.search.DisiPriorityQueue;
+import org.apache.lucene.search.DisiWrapper;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.MatchNoDocsQuery;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.search.QueryVisitor;
+import org.apache.lucene.search.ScoreMode;
+import org.apache.lucene.search.Scorer;
+import org.apache.lucene.search.ScorerSupplier;
+import org.apache.lucene.search.TermQuery;
+import org.apache.lucene.search.TwoPhaseIterator;
+import org.apache.lucene.search.Weight;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.DocIdSetBuilder;
+import org.apache.lucene.util.LongBitSet;
+import org.apache.lucene.util.PriorityQueue;
+
+public class TermInSetQuery extends Query {
+  // TODO: tunable coefficients. need to actually tune them (or maybe these are too complex and not
+  // useful)
+  private static final double J = 1.0;
+  private static final double K = 1.0;
+  // L: postings lists under this threshold will always be "pre-processed" into a bitset
+  private static final int L = 512;
+  // M: max number of clauses we'll manage/check during scoring (these remain "unprocessed")
+  private static final int M = Math.min(IndexSearcher.getMaxClauseCount(), 64);
+
+  private final String field;
+  // TODO: Not particularly memory-efficient; could use prefix-coding here but sorting isn't free
+  private final BytesRef[] terms;

Review Comment:
   you can use this tool when benchmarking to help make sure index no longer fits in RAM: https://github.com/mikemccand/luceneutil/blob/b48e7f49b19c27367737436214cc1ce7e67ad32c/src/python/ramhog.c
   
   or you can open up the computer and remove DIMMs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1416888408

   I fired up 3 related PRs to at least make it more of a fair fight :)
   
   #12127
   #12128
   #12129
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1420797458

   I spent mine time her just to prevent that code is duplicated and prevent a mess, nothing more. I got no skin in the game and could care less about these stupid abusive "joins" that people do.
   
   But let's please look at the current situation and say "we are gonna keep it clean" so I don't have to stay up so late anymore to prevent code mess. There is no advantage to piling all this shit in one query. breathe in, breathe out, let it go.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on a diff in pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

Posted by GitBox <gi...@apache.org>.
gsmiller commented on code in PR #12089:
URL: https://github.com/apache/lucene/pull/12089#discussion_r1072871141


##########
lucene/core/src/java/org/apache/lucene/search/DisiWrapper.java:
##########
@@ -57,4 +57,14 @@ public DisiWrapper(Scorer scorer) {
       matchCost = 0f;
     }
   }
+
+  public DisiWrapper(DocIdSetIterator iterator) {

Review Comment:
   This change is common to #12055, so I'm hoping we'd actually land it as part of that work and not needed just for this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on a diff in pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

Posted by GitBox <gi...@apache.org>.
gsmiller commented on code in PR #12089:
URL: https://github.com/apache/lucene/pull/12089#discussion_r1072874867


##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/queries/TermInSetQuery.java:
##########
@@ -0,0 +1,527 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.queries;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.Objects;
+import java.util.Set;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.LeafReader;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.PostingsEnum;
+import org.apache.lucene.index.SortedDocValues;
+import org.apache.lucene.index.SortedSetDocValues;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.index.TermState;
+import org.apache.lucene.index.Terms;
+import org.apache.lucene.index.TermsEnum;
+import org.apache.lucene.search.ConstantScoreScorer;
+import org.apache.lucene.search.ConstantScoreWeight;
+import org.apache.lucene.search.DisiPriorityQueue;
+import org.apache.lucene.search.DisiWrapper;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.MatchNoDocsQuery;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.search.QueryVisitor;
+import org.apache.lucene.search.ScoreMode;
+import org.apache.lucene.search.Scorer;
+import org.apache.lucene.search.ScorerSupplier;
+import org.apache.lucene.search.TermQuery;
+import org.apache.lucene.search.TwoPhaseIterator;
+import org.apache.lucene.search.Weight;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.DocIdSetBuilder;
+import org.apache.lucene.util.LongBitSet;
+import org.apache.lucene.util.PriorityQueue;
+
+public class TermInSetQuery extends Query {
+  // TODO: tunable coefficients. need to actually tune them (or maybe these are too complex and not
+  // useful)
+  private static final double J = 1.0;
+  private static final double K = 1.0;
+  // L: postings lists under this threshold will always be "pre-processed" into a bitset
+  private static final int L = 512;
+  // M: max number of clauses we'll manage/check during scoring (these remain "unprocessed")
+  private static final int M = Math.min(IndexSearcher.getMaxClauseCount(), 64);
+
+  private final String field;
+  // TODO: Not particularly memory-efficient; could use prefix-coding here but sorting isn't free
+  private final BytesRef[] terms;
+  private final int termsHashCode;
+
+  public TermInSetQuery(String field, Collection<BytesRef> terms) {
+    this.field = field;
+
+    final Set<BytesRef> uniqueTerms;
+    if (terms instanceof Set<BytesRef>) {
+      uniqueTerms = (Set<BytesRef>) terms;
+    } else {
+      uniqueTerms = new HashSet<>(terms);
+    }
+    this.terms = new BytesRef[uniqueTerms.size()];
+    Iterator<BytesRef> it = uniqueTerms.iterator();
+    for (int i = 0; i < uniqueTerms.size(); i++) {
+      assert it.hasNext();
+      this.terms[i] = it.next();
+    }
+    // TODO: compute lazily?
+    termsHashCode = Arrays.hashCode(this.terms);
+  }
+
+  @Override
+  public Weight createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost)
+      throws IOException {
+
+    return new ConstantScoreWeight(this, boost) {
+
+      @Override
+      public Scorer scorer(LeafReaderContext context) throws IOException {
+        ScorerSupplier supplier = scorerSupplier(context);
+        if (supplier == null) {
+          return null;
+        } else {
+          return supplier.get(Long.MAX_VALUE);
+        }
+      }
+
+      @Override
+      public ScorerSupplier scorerSupplier(LeafReaderContext context) throws IOException {
+        if (terms.length <= 1) {
+          throw new IllegalStateException("Must call IndexSearcher#rewrite");
+        }
+
+        // If the field doesn't exist in the segment, return null:
+        LeafReader reader = context.reader();
+        FieldInfo fi = reader.getFieldInfos().fieldInfo(field);
+        if (fi == null) {
+          return null;
+        }
+
+        return new ScorerSupplier() {
+
+          @Override
+          public Scorer get(long leadCost) throws IOException {
+            assert terms.length > 1;
+
+            // If there are no doc values indexed, we have to use a postings-based approach:
+            DocValuesType dvType = fi.getDocValuesType();
+            if (dvType != DocValuesType.SORTED && dvType != DocValuesType.SORTED_SET) {
+              return postingsScorer(reader);
+            }
+
+            if (terms.length > J * leadCost) {
+              // If the number of terms is > the number of candidates, a DV should perform better.
+              // Note that we don't know the actual number of terms here since terms may not be
+              // found in the segment:
+              return docValuesScorer(reader);
+            } else {
+              Terms t = reader.terms(field);
+              if (t == null) {
+                // If there are no postings, we have to use doc values:
+                return docValuesScorer(reader);
+              }
+
+              // Assuming documents are randomly distributed (note: they may not be), as soon as
+              // a given term's postings list size is >= the number of candidate documents, we
+              // would expect a that term's postings to require advancing for every candidate doc
+              // (in the average case). This cost is multiplied out by the number of terms, so we
+              // can estimate the total amount of posting list advances we'll need to do by
+              // taking the sum of the postings sizes across all of our terms. Unfortunately, it's
+              // costly to get the actual postings sizes (we need to seek to each term), so we
+              // do a gross estimate up-front by computing the average postings size. Further
+              // complicating matters, we don't know how many of the terms are actually in the
+              // segment (we assume they all are), and of course, our candidate size is also an
+              // estimate:
+              double avgTermAdvances = Math.min(leadCost, (double) t.getSumDocFreq() / t.size());

Review Comment:
   This may be a very poor estimation depending on the use-case. If the field follows a Zipf distribution, this is making some pretty poor assumptions (that all terms have approximately equal lengths). I think this bit will be challenging to do well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1406839169

   I found some time to come back to this and did some more benchmarking. I added a markdown file with some benchmark results in this PR for now just as a place to put it. It's [here](https://github.com/gsmiller/lucene/blob/explore/sandbox-tis-query-no-multi-dv-support/lucene/sandbox/src/java/org/apache/lucene/sandbox/queries/TiSBenchResults.md).
   
   This is all using a benchmark tool I wrote that's heavily based on something @rmuir did over in #12087. I'll figure how a place to upload that benchmark tool shortly for visibility.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1463886798

   Closing this out for now. I think my last comment provides a pretty good summary of where things landed here. I did experiment a bit with other ways of estimating cost within `TermInSetQuery`'s score supplier, but couldn't come up with anything useful. The core problem remains that using field-level statistics just isn't granular enough to make the "right decision" in all cases between postings or doc values, so unless we do some term seeking, I don't see a way to improve it. Since there was significant push-back on doing this, I'll close out for now. Thanks again everyone!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on a diff in pull request #12089: Explore TermInSet Query that "self optimizes"

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on code in PR #12089:
URL: https://github.com/apache/lucene/pull/12089#discussion_r1072871141


##########
lucene/core/src/java/org/apache/lucene/search/DisiWrapper.java:
##########
@@ -57,4 +57,14 @@ public DisiWrapper(Scorer scorer) {
       matchCost = 0f;
     }
   }
+
+  public DisiWrapper(DocIdSetIterator iterator) {

Review Comment:
   This change is common to #12055



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1420793280

   > 100%. The issue here is that `TermInSetQuery` gets rewritten to a `BooleanQuery` because there are fewer than 16 terms, so it doesn't have a chance to "self-optimize" to use doc values. We can fix this by not eagerly rewriting to a `BooleanQuery`, but I held off doing that for now. So this is "easily" fixable I think.
   
   Well right now, I'm not seeing any justification to mix up concerns and pile all into one query. Currently overall, IndexOrDocValues has the best performance: it is slightly slower in one case but massively faster in that case. 
   
   Also there are some problems with the benchmark, i tried to wrestle with it but there is some serious noise/gc/something going on. if i just rearrange order of tests numbers  change dramatically. 
   
   I am still willing to optimize the one case where IndexOrDocValues might do better, but I honestly don't feel I need to "play defense" any more. I think we should use IndexOrDocValues.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on a diff in pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

Posted by GitBox <gi...@apache.org>.
gsmiller commented on code in PR #12089:
URL: https://github.com/apache/lucene/pull/12089#discussion_r1072835306


##########
lucene/sandbox/src/java/org/apache/lucene/sandbox/queries/TermInSetQuery.java:
##########
@@ -0,0 +1,527 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.sandbox.queries;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.Objects;
+import java.util.Set;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.FieldInfo;
+import org.apache.lucene.index.LeafReader;
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.index.PostingsEnum;
+import org.apache.lucene.index.SortedDocValues;
+import org.apache.lucene.index.SortedSetDocValues;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.index.TermState;
+import org.apache.lucene.index.Terms;
+import org.apache.lucene.index.TermsEnum;
+import org.apache.lucene.search.ConstantScoreScorer;
+import org.apache.lucene.search.ConstantScoreWeight;
+import org.apache.lucene.search.DisiPriorityQueue;
+import org.apache.lucene.search.DisiWrapper;
+import org.apache.lucene.search.DocIdSetIterator;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.MatchNoDocsQuery;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.search.QueryVisitor;
+import org.apache.lucene.search.ScoreMode;
+import org.apache.lucene.search.Scorer;
+import org.apache.lucene.search.ScorerSupplier;
+import org.apache.lucene.search.TermQuery;
+import org.apache.lucene.search.TwoPhaseIterator;
+import org.apache.lucene.search.Weight;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.DocIdSetBuilder;
+import org.apache.lucene.util.LongBitSet;
+import org.apache.lucene.util.PriorityQueue;
+
+public class TermInSetQuery extends Query {
+  // TODO: tunable coefficients. need to actually tune them (or maybe these are too complex and not
+  // useful)
+  private static final double J = 1.0;
+  private static final double K = 1.0;
+  // L: postings lists under this threshold will always be "pre-processed" into a bitset
+  private static final int L = 512;
+  // M: max number of clauses we'll manage/check during scoring (these remain "unprocessed")
+  private static final int M = Math.min(IndexSearcher.getMaxClauseCount(), 64);
+
+  private final String field;
+  // TODO: Not particularly memory-efficient; could use prefix-coding here but sorting isn't free
+  private final BytesRef[] terms;

Review Comment:
   I generally agree. I was testing some interesting use-cases though with our current `TermInSetQuery` where we have 10,000+ terms (PK-type field), and the sorting it does actually came with a significant cost (and we got a pretty good win by removing it). But that's a bit of a different use-case maybe now that I think about it. We have a bloom filter in place as well, and a good share of the terms aren't actually in the index. So there's that...
   
   OK, yeah, we probably ought to sort here in a general implementation :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1416780926

   and when i look at the benchmarks, there are cases where IndexOrDV does better, and there are cases where this one does worse.  I'm not seeing a "clear winner", so i'd personally lean towards tie-breaking on "lower complexity". 
   
   We might be able to improve performance in easier ways that don't have complexity cost, such as replacing gigantic bitset with longhashset.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on a diff in pull request #12089: Modify TermInSetQuery to "self optimize" if doc values are available

Posted by "jpountz (via GitHub)" <gi...@apache.org>.
jpountz commented on code in PR #12089:
URL: https://github.com/apache/lucene/pull/12089#discussion_r1097579916


##########
lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java:
##########
@@ -380,21 +431,28 @@ public ScorerSupplier scorerSupplier(LeafReaderContext context) throws IOExcepti
         // cost estimates.
         final long cost;
         final long queryTermsCount = termData.size();
-        long potentialExtraCost = indexTerms.getSumDocFreq();
+        final long sumDocFreq = indexTerms.getSumDocFreq();
+        long potentialExtraCost = sumDocFreq;
         final long indexedTermCount = indexTerms.size();
         if (indexedTermCount != -1) {
           potentialExtraCost -= indexedTermCount;
         }
         cost = queryTermsCount + potentialExtraCost;
 
+        final boolean isPrimaryKeyField = indexedTermCount != -1 && sumDocFreq == indexedTermCount;

Review Comment:
   Since `terms.size()` is an optional index statistic, maybe check `sumDocFreq == docCount` instead?



##########
lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java:
##########
@@ -258,13 +271,41 @@ public Matches matches(LeafReaderContext context, int doc) throws IOException {
        * On the given leaf context, try to either rewrite to a disjunction if there are few matching
        * terms, or build a bitset containing matching docs.
        */
-      private WeightOrDocIdSet rewrite(LeafReaderContext context) throws IOException {
+      private WeightOrDocIdSet rewrite(
+          LeafReaderContext context, long leadCost, boolean isPrimaryKeyField, DocValuesType dvType)
+          throws IOException {
         final LeafReader reader = context.reader();
 
         Terms terms = reader.terms(field);
         if (terms == null) {
           return null;
         }
+
+        long costThreshold = Long.MAX_VALUE;
+        if (dvType == DocValuesType.SORTED || dvType == DocValuesType.SORTED_SET) {
+          // Establish a threshold for switching to doc values. Give postings a significant
+          // advantage for the primary-key case, since many of the primary-key terms may not
+          // actually be in this segment. The 8x factor is arbitrary, based on IndexOrDVQuery,
+          // but has performed well in benchmarks:
+          costThreshold = isPrimaryKeyField ? leadCost << 3 : leadCost;
+
+          if (termData.size() > costThreshold) {
+            // If the number of terms is > the number of candidates, DV should perform better.

Review Comment:
   I wonder if this is right given that the doc-values query still eagerly evaluates all terms against the terms dictionary. For this to work correctly, we'd need a query that looks up terms lazily rather than eagerly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1407400693

   I wonder for both these DV-queries (existing one in sandbox and new one here in sandbox), if they should really be using this `LongBitSet` to mark the matching ordinals found in the terms dict... it seems overkill/wasteful to create a bitset sized over the entire terms dictionary. 
   
   #12087 is using a `org.apache.lucene.document.LongHashSet` for this purpose.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1407020062

   thanks for the work benchmarking! you can rename to a .txt file and just attach it to a github comment, as one solution.
   
   yeah, this one looked to be trickier at a glance. simply indexing the same integer values as strings makes the queries quite a bit slower than what I saw with Points+DV on #12087. And not even using lots of query terms...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12089: [DRAFT] Explore TermInSet Query that "self optimizes"

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12089:
URL: https://github.com/apache/lucene/pull/12089#issuecomment-1408607273

   > I wonder for both these DV-queries (existing one in sandbox and new one here in sandbox), if they should really be using this LongBitSet to mark the matching ordinals found in the terms dict... it seems overkill/wasteful to create a bitset sized over the entire terms dictionary.
   
   Interesting. I'll experiment with that in this benchmark to see if it makes any real difference (+1 to being more memory efficient here if possible).
   
   > also personally i wouldn't target the sandbox at all when attacking this problem
   
   That's fair. I'll target "non sandbox."
   
   
   I'll work on tightening up the code a bit and adding testing soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org