You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "gsmiller (via GitHub)" <gi...@apache.org> on 2023/02/08 14:48:59 UTC

[GitHub] [lucene] gsmiller opened a new pull request, #12135: Avoid duplicate sorting and prefix-encoding in KeywordField#newSetQuery

gsmiller opened a new pull request, #12135:
URL: https://github.com/apache/lucene/pull/12135

   ### Description
   
   Term sorting and prefix-coding can be a significant amount of work with a large number of terms. This change allows `KeywordField#newSetQuery` to do that work once, instead of `TermInSetQuery` and `SortedSetDocValuesSetQuery` duplicating the effort (and memory footprint as well).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on a diff in pull request #12135: Avoid duplicate sorting and prefix-encoding in KeywordField#newSetQuery

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on code in PR #12135:
URL: https://github.com/apache/lucene/pull/12135#discussion_r1100315138


##########
lucene/core/src/java/org/apache/lucene/index/PrefixCodedTerms.java:
##########
@@ -39,6 +43,34 @@ public class PrefixCodedTerms implements Accountable {
   private long delGen;
   private int lazyHash;
 
+  /** Create {@link PrefixCodedTerms} from a single field and array of terms. */
+  public static PrefixCodedTerms ofFieldTerms(String field, BytesRef... terms) {
+    return ofFieldTerms(field, Arrays.asList(terms));
+  }
+
+  /** Create a {@link PrefixCodedTerms} for a single field and collection of terms. */
+  public static PrefixCodedTerms ofFieldTerms(String field, Collection<BytesRef> terms) {

Review Comment:
   Let's not add this Collection stuff: we can just support arrays.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424699768

   Let me try to clarify a couple things:
   
   First, this is not an Elasticsearch use-case. Probably not all the important, but this is for an Amazon Product Search case, where we use Lucene directly. We're not applying access right, etc. We have use-cases where we need to evaluate allow-lists of products.
   
   Yes, we can fork all these queries. But this looks like an opportunity to avoid cloning and sorting an array multiple times. If both of these underlying queries need sorted terms, it seems like it would be nice to avoid it.
   
   Yes, a public constructor that requires sorted data should check it. The current proposal is doing this by leveraging the class type of the collection.
   
   I also agree with not exposing the prefix-encoded implementation detail in the constructor, which I've also addressed in the latest revision.
   
   It seems like the pushback here amounts to, 1) not wanting to mess with collections instead of arrays, and 2) generally feeling like this is an unnecessary complication for a non-general use-case. I personally disagree on both points. Sure, arrays are better in a number of ways, but this seems like a reasonable use of collections to "mark" sortedness as part of the type contract. While I also recognize that large cardinality term cases maybe aren't the "norm," we could extend that argument to say we shouldn't both with `TermInSetQuery` at all. BooleanQuery is just fine with few terms. I could create a slippery slope argument pretty quickly that says, "no more term in set at all!" I don't like what's at the bottom of that slope.
   
   So I disagree with the pushback (at least in the current revision; I fully understand the pushback around not exposing prefix encoding details). But I don't disagree so strongly that I'll keep fighting for this if folks are unhappy with it. I'll do one more pass to see if I can leverage `TimSort`'s efficient implementation on pre-sorted data. If that pans out, maybe it's enough to just sort the terms in `KeywordField` before passing them along. The underlying queries can still do their own sorting, which will just be a no-op. We'll still do some unnecessary array cloning, but maybe it doesn't matter.
   
   Anyway, I appreciate the feedback and the discussion!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Avoid duplicate sorting in KeywordField#newSetQuery [lucene]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1880903859

   This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] uschindler commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

Posted by "uschindler (via GitHub)" <gi...@apache.org>.
uschindler commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1425837156

   For now I kept the TreeSet in KeywordField, but we can think about that separately. I don't like it too much. Maybe a lazy stream would be best to prevent double sorting. See discussion previously about the downside of IndexOrDocvaluesQuery.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424473544

   > But it brings java Collections into the picture, which seems not great versus using simple arrays.
   
   I don't love that either. An alternative could be to just require the ctor to be invoked with a pre-sorted array and not mess about with array copying and sorting. It would of course need to be a non-back compatible change released in 10, but that's OK.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1425765853

   PS: one possibility to remove the 'separate queries' would be to define this in terms of multitermquery and instead move implementation stuff to RewriteMethod. I suggested it a long time ago, because it could prevent issues like this.
   
   And this query is the most basic example of "Multi Term Query" I can imagine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] uschindler commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

Posted by "uschindler (via GitHub)" <gi...@apache.org>.
uschindler commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1425779002

   > I see this as a feature, not a bug. As you can see, left unchecked, "database functionality" can be quite invasive on the codebase.
   
   The separate query is of course fine, the problem is only that child nodes of the IndexOrDocavlauesQuery may have an expensive ctor (like the one popuplating PrefixCodedTerms).
   
   Basically if all would be in MTQ rewrite this could be handled. If we keep IndexOrDocvaluesQuery we would need to make the soring and PrefixCodedTerms lazy. A good idea would be to make the ctor arguments of IndexOrDocvaluesQuery lazy (producers or functions) and the actual rewrite will create the subqueries. But this would break equals and hashcode, as IndexOrDocavluesQuery relies on the hashcode and equals of the childs.
   
   Because of this MTQ would be much better. The MTQ instance knows everything for equals/hashcode and the rewritten form is lazily created.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1425786387

   If it was done as MTQ, the points+dv case would still need some integration (e.g. IndexOrDocValuesQuery(PointInSetQuery, MTQ(using docvalues)...). So I imagine that case would not get any better, which is sad since if you want to do this stuff, using points is really the way to go: you must make your IDs fixed width or numeric, but they execute much faster than the terms versions. no ordinal lookups in each segment, better cost estimation, etc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] uschindler commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

Posted by "uschindler (via GitHub)" <gi...@apache.org>.
uschindler commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1425749651

   Actually my idea was to keep PrefixCodedTerms changes out of the game. It stays internal class an it will not be modified at all.
   
   We just need a `java.util.stream.Collector` (or similar, I'd prefer that as it is a [mutable reduction](https://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html#Reduction)) that produces a PrefixCodedTerms instance from a stream of BytesRef. Then you can feed in any bytesref container (array, collection with or without sorting and deduplication).
   
   Let me hack this together later.
   
   Another problem I see is the following: The current code uses IndexOrDocvaluesQuery. This creates both queries and there is the stupid part: We sort the terms and popuplate PrefixCodedTerms, although it might no even use the query at execution time at all. This is in my opinion the main issue, not the sorting alone.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12135: Avoid duplicate sorting and prefix-encoding in KeywordField#newSetQuery

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424361807

   To avoid leaking internal implementation details of the two query implementations, I've taken a different approach that doesn't expose the fact that we're internally representing terms as prefix-coded data.
   
   This new approach just avoids duplicate sorting/cloning, but the prefix encoding is still done twice. I did some informal "benchmarking" of the two approaches with a simple unit test (see below), and the sorting/cloning of values is the most impactful bit of work to save. The duplicate prefix encoding didn't really register in the benchmarks (i.e., this approach and the previous approach were essentially the same).
   
   To make it a "fair fight" with the current version on `main`, I fixed an issue in the current implementation where the values are not cloned before being passed to the `SortedSetDocValuesSetQuery` ctor. The current code on `main` is sorting the provided values array in place, which probably isn't the right thing to do (we correctly clone everywhere else we use `SortedSetDocValuesSetQuery`).
   
   With the cloning in place, this PR cuts the "benchmark" results roughly in half (from 134ms to 68ms on my MacBook).
   
   Here's my simple "benchmark":
   ```
     public void testSortPerformance() {
       int len = 50000;
       BytesRef[] terms = new BytesRef[len];
       for (int i = 0; i < len; i++) {
         String s = TestUtil.randomSimpleString(random(), 10, 20);
         terms[i] = new BytesRef(s);
       }
   
       int iters = 300;
       for (int i = 0; i < iters; i++) {
         KeywordField.newSetQuery("foo", terms);
       }
   
       long minTime = Long.MAX_VALUE;
       for (int i = 0; i < iters; i++) {
         long t0 = System.nanoTime();
         KeywordField.newSetQuery("foo", terms);
         minTime = Math.min(minTime, System.nanoTime() - t0);
       }
   
       System.err.println("Time: " + minTime / 1_000_000);
     }
   ```
   
   The downside of this approach is that we're duplicating the memory footprint of the prefix encoded terms. I think that's an OK trade-off for now though to avoid exposing implementation details of `TermInSetQuery`, since there's no way to reduce the visibility of that ctor (we'd have to just tag it `@lucene.internal`).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on a diff in pull request #12135: Avoid duplicate sorting and prefix-encoding in KeywordField#newSetQuery

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on code in PR #12135:
URL: https://github.com/apache/lucene/pull/12135#discussion_r1101623346


##########
lucene/core/src/java/org/apache/lucene/document/KeywordField.java:
##########
@@ -168,8 +171,10 @@ public static Query newExactQuery(String field, String value) {
   public static Query newSetQuery(String field, BytesRef... values) {
     Objects.requireNonNull(field, "field must not be null");
     Objects.requireNonNull(values, "values must not be null");
+    SortedSet<BytesRef> sortedValues = new TreeSet<>(Arrays.asList(values));
     return new IndexOrDocValuesQuery(
-        new TermInSetQuery(field, values), new SortedSetDocValuesSetQuery(field, values));

Review Comment:
   If we don't end up merging this change, we should clone `values` before invoking `SortedSetDocValuesSetQuery#new`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Avoid duplicate sorting in KeywordField#newSetQuery [lucene]

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller closed pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery
URL: https://github.com/apache/lucene/pull/12135


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] uschindler commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

Posted by "uschindler (via GitHub)" <gi...@apache.org>.
uschindler commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424631344

   Hi,
   I fully agree with Robert here. I see no reason why you want to avoid a stupid no-op Arrays.sort(). Yes if the array is large, the sort takes some time, sorry. But what does this have to do with PrefixCodedTerms? Why is Elasticsearch using this private API?
   
   I would agree if we can internally have some special-case ctor that is package private (or it is shielded by using some shared-secrets approach if it needs to work cross-package), which allows to "reuse" an already existing PrefixCodedTerms instance internally when you build a TermInSetQuery from some case like building a TermInSetQuery for the joins module from a Term list. But PrefixCodedTerms should never ever be part of public APIs. If you want it "correct" use Java 8 streams API, although Adrien does not like it, but it is the "correct way" to do this kind of stuff.
   
   If Elasticsearch needs to do some internal stuff and can't live with a (cheap) resort of an array or using java.util.Stream (if it is already sorted, TimSort will just linear scan through the array), it should fork the query as this is a special case and does not need to be supported in Lucene. All APIs are public, so you could make a subclass/fork  of TermInSetQuery that relies on some Elasticsearch-internal array container clsses that have a defined sort order. It is not Lucene's task to add support in our public APIs for some misuse of Lucene (and those lists of terms for access rights is a misuse of Lucene: it is a search engine and not a US NASA-for-travel-to-mars certified FIPS-0815 foobar bullshit with high security compliant and unbreakable datastore with strict user access checks on millions of documents).
   
   Tell your customers that this is slower if you apply crappy access rights that should never have been in a serach engine!
   
   Keep in mind: In any public API we have for sure to either always sort it or alterantively if we require a presorted array, we have to actually check this. If one passes an array in a public parameter and the defintion says it has to be presorted: WE MUST CHECK THIS!. If one would pass an unsorted array we would need to throw exception, so we need to check correct order as part of param validation. But as checking that the array is in correct order is as expensive as just sorting it, we should simply sort it. There's no discussion possible!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424470476

   @rmuir The use case where I've seen this in the wild has to to with allow/deny lists. We have some use-cases where we only want to match documents that exist in some allow-list. That allow-list can be quite large (potentially tens of thousands), but many of the terms aren't present in a given index we're searching. We use the bloom filter codec to efficiently drop terms not present. So we have a large number of terms we need to initialize our `TermInSetQuery` with, but a much smaller number of them actually end up term-seeking, etc., so the sorting actually appears to dominate when we've profiled.
   
   I've "redacted" a bunch of this flame chart since this was on an internal system, but you can see how long we're spending sorting terms vs. everything else here:
   <img width="1488" alt="Screen Shot 2022-10-21 at 4 05 03 PM" src="https://user-images.githubusercontent.com/16479560/217875948-854560b2-59d1-44d7-99fa-cde87ceef4c5.png">
   
   Deny-listing can obviously have the same issue. I believe `TermInSetQuery` was originally created to handle deny-lists in tinder search where there can be tens of thousands of "swipe left" profiles that need to be excluded from results?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1425757874

   > Another problem I see is the following: The current code uses IndexOrDocvaluesQuery. This creates both queries and there is the stupid part: We sort the terms and popuplate PrefixCodedTerms, although it might no even use the query at execution time at all. This is in my opinion the main issue, not the sorting alone.
   
   I see this as a feature, not a bug. As you can see, left unchecked, "database functionality" can be quite invasive on the codebase. There are all these possibilities for this "join" query:
   * Terms
   * DV
   * Points
   * Terms + DV
   * Points + DV
   
   I think it is the correct tradeoff to just have 3 simple queries, one for each underlying index structure, and use IndexOrDocValues to combine them. It keeps this functionality maintainable and in check.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on a diff in pull request #12135: Avoid duplicate sorting and prefix-encoding in KeywordField#newSetQuery

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on code in PR #12135:
URL: https://github.com/apache/lucene/pull/12135#discussion_r1100378648


##########
lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java:
##########
@@ -81,34 +77,21 @@ public class TermInSetQuery extends Query implements Accountable {
   private final PrefixCodedTerms termData;
   private final int termDataHashCode; // cached hashcode of termData
 
-  /** Creates a new {@link TermInSetQuery} from the given collection of terms. */
-  public TermInSetQuery(String field, Collection<BytesRef> terms) {
-    BytesRef[] sortedTerms = terms.toArray(new BytesRef[0]);
-    // already sorted if we are a SortedSet with natural order
-    boolean sorted =
-        terms instanceof SortedSet && ((SortedSet<BytesRef>) terms).comparator() == null;
-    if (!sorted) {
-      ArrayUtil.timSort(sortedTerms);
-    }
-    PrefixCodedTerms.Builder builder = new PrefixCodedTerms.Builder();
-    BytesRefBuilder previous = null;
-    for (BytesRef term : sortedTerms) {
-      if (previous == null) {
-        previous = new BytesRefBuilder();
-      } else if (previous.get().equals(term)) {
-        continue; // deduplicate
-      }
-      builder.add(field, term);
-      previous.copyBytes(term);
-    }
-    this.field = field;
-    termData = builder.finish();
-    termDataHashCode = termData.hashCode();
+  /** Creates a new {@link TermInSetQuery} from the given prefix-coded terms. */
+  public TermInSetQuery(String field, PrefixCodedTerms terms) {

Review Comment:
   This ctor needs to be marked internal for sure, because this class is public. Users should not be messing with PrefixCodedTerms at all. It is implementation detail of indexer for frozen deletes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] uschindler commented on pull request #12135: Avoid duplicate sorting and prefix-encoding in KeywordField#newSetQuery

Posted by "uschindler (via GitHub)" <gi...@apache.org>.
uschindler commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1422938255

   Of course it would still be needed to check performance of huge sets, but I think the stream overhead is so highly optimized in recent JDKs, so theres no issue (especially for large collections and many terms it is better memory-wise to not copy everything into an array and sort it). The user could also make the stream parallel :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12135: Avoid duplicate sorting and prefix-encoding in KeywordField#newSetQuery

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1423017019

   Just to make sure I understand the suggestion here, it sounds like the idea would be to use a `Stream<BytesRef>` in place of `PrefixCodedTerms` both in `TermInSetQuery` and `SortedSetDocValuesSetQuery` right? And these streams would be backed by the same `BytesRef[]` provided to `KeywordField#newSetQuery`? I think there are a couple challenges with that (but maybe I'm misunderstanding the suggestion completely):
   1. Wouldn't we still be sorting the data twice since we would need to provide separate streams to each query? Or I suppose we could maybe share a single stream across the queries if we knew for sure that only one would consume it?
   2. Within each query implementation, we may end up iterating over the terms multiple times (e.g., if someone did a `toString`, `visit`, create a weight, etc.). I don't have a whole lot of experience using streams, but my understanding is they provide a "single consumption" model. So I'm not sure how this would work?
   
   I suspect I'm misunderstanding the suggestion, so just trying to clarify and figure out where I've gotten confused. Thanks for the input! I like the idea, I'm just trying to figure out how to make it workable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on a diff in pull request #12135: Avoid duplicate sorting and prefix-encoding in KeywordField#newSetQuery

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on code in PR #12135:
URL: https://github.com/apache/lucene/pull/12135#discussion_r1100407897


##########
lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java:
##########
@@ -81,34 +77,21 @@ public class TermInSetQuery extends Query implements Accountable {
   private final PrefixCodedTerms termData;
   private final int termDataHashCode; // cached hashcode of termData
 
-  /** Creates a new {@link TermInSetQuery} from the given collection of terms. */
-  public TermInSetQuery(String field, Collection<BytesRef> terms) {
-    BytesRef[] sortedTerms = terms.toArray(new BytesRef[0]);
-    // already sorted if we are a SortedSet with natural order
-    boolean sorted =
-        terms instanceof SortedSet && ((SortedSet<BytesRef>) terms).comparator() == null;
-    if (!sorted) {
-      ArrayUtil.timSort(sortedTerms);
-    }
-    PrefixCodedTerms.Builder builder = new PrefixCodedTerms.Builder();
-    BytesRefBuilder previous = null;
-    for (BytesRef term : sortedTerms) {
-      if (previous == null) {
-        previous = new BytesRefBuilder();
-      } else if (previous.get().equals(term)) {
-        continue; // deduplicate
-      }
-      builder.add(field, term);
-      previous.copyBytes(term);
-    }
-    this.field = field;
-    termData = builder.finish();
-    termDataHashCode = termData.hashCode();
+  /** Creates a new {@link TermInSetQuery} from the given prefix-coded terms. */
+  public TermInSetQuery(String field, PrefixCodedTerms terms) {

Review Comment:
   Good callout, thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] uschindler commented on pull request #12135: Avoid duplicate sorting and prefix-encoding in KeywordField#newSetQuery

Posted by "uschindler (via GitHub)" <gi...@apache.org>.
uschindler commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1422922194

   Actually instead of taking a collection or array of BytesRef, we could make the ctor take a `(Stream<BytesRef> stream)` and consume the stream by calling: `stream().sorted().distinct().doOurStuffByForEach()`.
   
   The cool trick is that streams have the markers for "distinct" and "sorted" internally. So `sorted()` and `distinct()` do nothing if the stream is already marked using those flags. If not, the method returns a new Stream which sorts and has the correct flags. You should always have it in order "first sort, then distinct" (see some comments in JDK issues, it is no longer mandatory for optimal perf but always better to first sort then dedup).
   
   The array ctor would just call Stream.of(array). This has no defined order and dedup, so the ctor taking the stream would sort for us. If we have a sorted list, we can trick the stream API to set the sorted and deduped flag with a helper method using [spliterator](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/Spliterators.html#spliterator(java.util.Iterator,long,int)) of correct size and correct flags.
   
   At end the ctor taking stream is plain simple and requests a sorted, distinct stream and it is up to the caller to have it presorted.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


Re: [PR] Avoid duplicate sorting in KeywordField#newSetQuery [lucene]

Posted by "uschindler (via GitHub)" <gi...@apache.org>.
uschindler commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1880959801

   I think we can also close this one, correct? I already closed my proposal: #12141 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1425676474

   > I disagree with Robert who says "only arrays". I agree with you we can also allow to pass collections. But only when we do it as proposed before:
   
   Well, the `newSetQuery()` takes an array, so i don't like adding also collections also, unnecessarily. But especially i don't like complexity added with collections into `PrefixCodedTerms` which has nothing to do with arrays nor collections at all and is just a little part of the indexer becoming made more complicated by these horrible database-join-queries :(


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424408782

   btw, latest commit is slightly better as at least it does not expose `PrefixCodedTerms` in a public API. But it brings java Collections into the picture, which seems not great versus using simple arrays.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12135: Avoid duplicate sorting and prefix-encoding in KeywordField#newSetQuery

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1422891907

   I'm not sure how i feel about exposing PrefixCodedTerms in public apis. This really seems like an over-optimization and causes clutter. Maybe we can get @uschindler thoughts and consider using streams internally and avoid messing with PrefixCodedTerms, i think its possible to implement the `KeywordField#newSetQuery` so that stream is only sorted once.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] jpountz commented on pull request #12135: Avoid duplicate sorting and prefix-encoding in KeywordField#newSetQuery

Posted by "jpountz (via GitHub)" <gi...@apache.org>.
jpountz commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1423019557

   Thanks for looking into this @gsmiller I agree we should avoid computing the prefix terms twice. I like the approach of having a private ctor that takes prefix coded terms and using it from the `KeywordField#newSetQuery` method. As far as streams are concerned, Uwe's arguments are appealing but I have a (subjective) distaste of the Stream API which makes me prefer avoiding to use them in our public APIs. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on a diff in pull request #12135: Avoid duplicate sorting and prefix-encoding in KeywordField#newSetQuery

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on code in PR #12135:
URL: https://github.com/apache/lucene/pull/12135#discussion_r1101624944


##########
lucene/core/src/java/org/apache/lucene/index/PrefixCodedTerms.java:
##########
@@ -39,6 +43,34 @@ public class PrefixCodedTerms implements Accountable {
   private long delGen;
   private int lazyHash;
 
+  /** Create {@link PrefixCodedTerms} from a single field and array of terms. */
+  public static PrefixCodedTerms ofFieldTerms(String field, BytesRef... terms) {
+    return ofFieldTerms(field, Arrays.asList(terms));
+  }
+
+  /** Create a {@link PrefixCodedTerms} for a single field and collection of terms. */
+  public static PrefixCodedTerms ofFieldTerms(String field, Collection<BytesRef> terms) {

Review Comment:
   I ended up keeping this in this alternate version as a way of "marking" data terms as already being sorted.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424397388

   I still don't understand why we are adding all this special logic to PrefixCodedTerms, to avoid a single Arrays.sort? The sorting only happens once in the query, its not like it takes milliseconds or anything to sort a damn array!
   
   Need to understand the use-case here. I suspect its a severe database-abuse case. I have to push back on these since lucene is a search engine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] uschindler commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

Posted by "uschindler (via GitHub)" <gi...@apache.org>.
uschindler commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424845306

   Sorry I thought this is Elasticsearch, because generally I get like daiky people asking how to hide their documents from users. At end they are always complaining that their SQL IN like queries are slow. This is why I always get angry and try to explain to people that they need to nenormalize their data. Instead of having huge lists of documents a user may see, it is better to instead index a multivalued field "users that can see this document". This always causes problems with those people that want to add new users all the time and need to therefore reindex everything. My general rule I tell them is always: index the user information in your weekly reindex and for those new ones to the heavy "SQL IN" query. But for the users that have their document they can see already reibdexes you can just query for a single term (their username".
   
   So you may understand that I get angry about that at some point. License has no way to manage access rights with all shit like group access and do on. You need to always so some mixed approach. Denormalize as much as possible, and if not possible or partially use explicit IN queries. But really avoid that and use the inverted index for what it was invented.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] uschindler commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

Posted by "uschindler (via GitHub)" <gi...@apache.org>.
uschindler commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424863546

   I disagree with Robert who says "only arrays". I agree with you we can also allow to pass collections. But only when we do it as proposed before:
   
   We can have 2 ctors, one with `Collection<BytesRef>` and one with `BytesRef[]`. But both should call a third hidden ctor taking `Stream<BytesRef>`.
   
   This internal implementation would call: `stream.sorted()` (and possibly also `.distinct()`) and just operate on the stream. If you pass in a SortedSet (e.g. TreeSet) the sorted and distinct calls will be no-ops. It will not sort the stuff again, IF (and only IF) the comparator of the Treeset/sortedset is exactly the one we use. So it is 100% type safe.
   
   I hope Adrien understand that this is the best way to avoid duplicate sorting:
   
   ```java
   public TermInSetQuery(String field, BytesRef[] terms) {
     this(field, Arrays.stream(terms);
   }
   
   public TermInSetQuery(String field, Collection<BytesRef> terms) {
     this(field, terms.stream();
   }
   
   private TermInSetQuery(String field, Stream<BytesRef> stream) {
     super(field); // and so on
     stream.sorted().distinct().forEachOrdered(term -> {
        // process the terms coming in natural order
     });
   }
   ```
   
   This would be my favorite way to implement this query. I don't think Adrien's mental problem with streams is an argument without a benchmark.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1425690083

   > Yes, a public constructor that requires sorted data should check it. The current proposal is doing this by leveraging the class type of the collection.
   
   To me this is an AWFUL lot of complexity. Yes, you are just "marking" something as sorted, but java collections are just as bad to me as java streams are to adrien. Please, lets keep such stuff away from PrefixCodedTerms and the indexer.
   
   If we can't implement TermInSetQuery without mucking up the indexer here, then maybe indeed we should just tell such database-abusers to use BooleanQuery instead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] uschindler commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

Posted by "uschindler (via GitHub)" <gi...@apache.org>.
uschindler commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1425835678

   Here is my alternative approach without any crazy addition of collections to PrefixCodedTerms (I just placed the collector there as I needed a common place to be reachable from both queries without adding another utility class): PR #12141
   
   I have not checked performance. I am happy that this shitty instanceof checks are gone. Looks much cleaner to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org