You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by "shubhamvishu (via GitHub)" <gi...@apache.org> on 2023/07/09 20:26:30 UTC

[GitHub] [lucene] shubhamvishu opened a new pull request, #12427: StringsToAutomaton#build to take List as parameter instead of Collection

shubhamvishu opened a new pull request, #12427:
URL: https://github.com/apache/lucene/pull/12427

   ### Description
   
   - This addresses issue #12319 
   - `StringsToAutomaton` previously called `DaciukMihovAutomatonBuilder`(renamed in this PR) should take List as input  instead of Collection as it expects sorted input.
   
   &nbsp;
   Closes #12319
   
   <!--
   If this is your first contribution to Lucene, please make sure you have reviewed the contribution guide.
   https://github.com/apache/lucene/blob/main/CONTRIBUTING.md
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [PR] StringsToAutomaton#build to take List as parameter instead of Collection [lucene]

Posted by "benwtrent (via GitHub)" <gi...@apache.org>.

benwtrent commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1789218896

   Something weird is happening. This commit is causing the following failure:
   ```
   ./gradlew test --tests TestUnifiedHighlighterTermIntervals.testCustomFieldValueSource -Dtests.seed=9E4091CE5D400BA3
   ```
   
   I am starting to investigate, but folks involved with this change might find the issue quicker.
   
   ```
     >     java.lang.NoSuchMethodError: 'org.apache.lucene.util.automaton.Automaton org.apache.lucene.util.automaton.Automata.makeStringUnion(java.util.Collection)'
      >         at org.apache.lucene.search.uhighlight.CharArrayMatcher.fromTerms(CharArrayMatcher.java:42)
      >         at org.apache.lucene.search.uhighlight.MemoryIndexOffsetStrategy.buildCombinedAutomaton(MemoryIndexOffsetStrategy.java:71)
      >         at org.apache.lucene.search.uhighlight.MemoryIndexOffsetStrategy.<init>(MemoryIndexOffsetStrategy.java:53)
      >         at org.apache.lucene.search.uhighlight.UnifiedHighlighter.getOffsetStrategy(UnifiedHighlighter.java:1238)
      >         at org.apache.lucene.search.uhighlight.UnifiedHighlighter.getFieldHighlighter(UnifiedHighlighter.java:1072)
      >         at org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlightFieldsAsObjects(UnifiedHighlighter.java:880)
      >         at org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlightFields(UnifiedHighlighter.java:814)
      >         at org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlightFields(UnifiedHighlighter.java:792)
      >         at org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlight(UnifiedHighlighter.java:725)
      >         at org.apache.lucene.search.uhighlight.TestUnifiedHighlighterTermIntervals.testCustomFieldValueSource(TestUnifiedHighlighterTermIntervals.java:607)
      >         at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
      >         at java.base/java.lang.reflect.Method.invoke(Method.java:580)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
      >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
      >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
      >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
      >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
      >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
      >         at junit@4.13.1/org.junit.rules.RunRules.evaluate(RunRules.java:20)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
      >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
      >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
      >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
      >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
      >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
      >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
      >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
      >         at junit@4.13.1/org.junit.rules.RunRules.evaluate(RunRules.java:20)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
      >         at java.base/java.lang.Thread.run(Thread.java:1583)
      >
      >     java.lang.RuntimeException: MockDirectoryWrapper: cannot close: there are still 7 open files: {_0_Lucene90_0.pos=1, _0_Lucene90_0.doc=1, _0_Lucene90_0.tim=1, _0.nvd=1, _0.fdx=1, _0.fdt=1, _0_Lucene90_0.tip=1}
      >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:876)
      >         at org.apache.lucene.search.uhighlight.TestUnifiedHighlighterTermIntervals.doAfter(TestUnifiedHighlighterTermIntervals.java:84)
      >         at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
      >         at java.base/java.lang.reflect.Method.invoke(Method.java:580)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:1004)
      >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
      >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
      >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
      >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
      >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
      >         at junit@4.13.1/org.junit.rules.RunRules.evaluate(RunRules.java:20)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
      >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
      >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
      >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
      >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
      >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
      >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
      >         at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
      >         at junit@4.13.1/org.junit.rules.RunRules.evaluate(RunRules.java:20)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
      >         at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
      >         at java.base/java.lang.Thread.run(Thread.java:1583)
      >
      >         Caused by:
      >         java.lang.RuntimeException: unclosed IndexInput: _0.nvd
      >             at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.store.MockDirectoryWrapper.addFileHandle(MockDirectoryWrapper.java:783)
      >             at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:835)
      >             at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.codecs.lucene90.Lucene90NormsProducer.<init>(Lucene90NormsProducer.java:84)
      >             at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.codecs.lucene90.Lucene90NormsFormat.normsProducer(Lucene90NormsFormat.java:96)
      >             at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.codecs.asserting.AssertingNormsFormat.normsProducer(AssertingNormsFormat.java:46)
      >             at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:108)
      >             at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:95)
      >             at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:180)
      >             at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:222)
      >             at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.IndexWriter.lambda$getReader$0(IndexWriter.java:539)
      >             at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:138)
      >             at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:601)
      >             at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.IndexWriter$4.getReader(IndexWriter.java:6510)
      >             at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.index.RandomIndexWriter.getReader(RandomIndexWriter.java:508)
      >             at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.index.RandomIndexWriter.getReader(RandomIndexWriter.java:426)
      >             at org.apache.lucene.search.uhighlight.TestUnifiedHighlighterTermIntervals.testCustomFieldValueSource(TestUnifiedHighlighterTermIntervals.java:581)
      >             at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
      >             at java.base/java.lang.reflect.Method.invoke(Method.java:580)
      >             at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
      >             at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
      >             at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
      >             at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [PR] StringsToAutomaton#build to take List as parameter instead of Collection [lucene]

Posted by "shubhamvishu (via GitHub)" <gi...@apache.org>.

shubhamvishu commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1790129043

   Nice!....I like it how we discovered a build issue bug(which was hidden for sometime now I think) and fixed it. Thanks for solving @dweiss!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [PR] StringsToAutomaton#build to take List as parameter instead of Collection [lucene]

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.

mikemccand commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1784156071

   > Below are the results(looks all good to me). Let me know what do you think? Thanks!
   
   +1 -- looks like just noise to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [PR] StringsToAutomaton#build to take List as parameter instead of Collection [lucene]

Posted by "benwtrent (via GitHub)" <gi...@apache.org>.

benwtrent commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1789266991

   @gsmiller you are 100% correct 🤦 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #12427: StringsToAutomaton#build to take List as parameter instead of Collection

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.

gsmiller commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1631550554

Yeah, good points/questions! I'd be curious how much overhead it would actually add to sort the input when it's already sorted?

But to take a step back for a moment, we also have automata building methods that accept a `BytesRefIterator` as well, which also must be sorted. This is a situation where we really cannot sort on behalf of the caller, so it might be a bit confusing/trappy to sort some flavors of this method but not others? Maybe it's best to leave these methods as they are?

If we want to make these functions a bit more user-friendly, we could look at changing the `assert` on line 276 of `StringsToAutomaton` to throw an explicit `IllegalArgumentException` so that we don't silently built a corrupt automaton on unordered input (with asserts disabled). There _would_ add overhead since we have to now keep track of the previous term all the time, but maybe it's worth benchmarking and considering this change? I _do_ think it's best to explicitly let the user know they passed invalid input in a case like this, so it would be a nice change if it didn't introduce a significant performance drag.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [PR] StringsToAutomaton#build to take List as parameter instead of Collection [lucene]

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.

gsmiller commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1785924607

   +1 looks good to me as well. I like that this small change, 1) makes the API a little more general, allowing users to provide any Iterable instead of Collection, and 2) adds an explicit check for expected ordering of the iterable using IAE instead of relying on asserts being enabled. Thanks for following up!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] shubhamvishu commented on pull request #12427: StringsToAutomaton#build to take List as parameter instead of Collection

Posted by "shubhamvishu (via GitHub)" <gi...@apache.org>.

shubhamvishu commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1630241517

   @gsmiller I totally agree its not helpful if the input/data `Collection` is sorted since its a unnecessary overhead to convert to a `List` here. As per `TermInSetQuery` ctor its checking if the `collection` is of `SortedSet` type and uses natural ordering. But what if the user is explicitly passing a List (or any `Collection` for that matter) which is already sorted but is not `SortedSet`?
   Could we expose another `#build` function as `build(Collection<BytesRef> input, boolean asBinary, boolean isSorted)` or for better maybe `makeStringUnion(Collection<BytesRef> utf8Strings, boolean isSorted)` (similar to the one in tests in this PR) to completely bypass the sorting/checks if incase `isSorted == true` else we could have the same check as in `TermInSetquery` ctor and sort the terms incase they are not before moving on in `Automata#makeStringUnion` as you mentioned? Let me know your thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #12427: StringsToAutomaton#build to take List as parameter instead of Collection

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.

gsmiller commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1629757874

Thanks @shubhamvishu for working on this! As I look at the PR, I'm wondering if accepting a `List` is really the proper thing to do here. If users already have a sorted `Collection` (like `SortedSet`), are we forcing them to make unnecessary copies of their data to adhere to our API?

I wonder if the ctor for `TermInSetQuery` might serve as a reasonable example of handling sortedness? `TermInSetQuery` accepts a `Collection` but tries to see if it's already sorted (and de-duped). If it is, it uses it as-is. If not, it sorts and de-dupes on behalf of the user. The automaton building logic right now only has a simple `assert` down in the internals, which could be disabled depending on the runtime environment, resulting in tricky debugging if the user didn't handle sorting their data properly.

Since `StringsToAutomaton` is an internal class, maybe one approach would be to leave it untouched, letting it expect pre-sorted data and having a simple `assert` statement as it does today to check that (for easier debugging), but add sorting logic to `Automata#makeStringUnion`? What if that method accepted a `Collection` as it does today but had similar logic as the `TermInSetQuery` ctor related to sorting?

Apologies for leading us in the wrong direction in the issue I opened. I think a `List` is a bit more semantically appropriate than a `Collection`, but I'm not sure it's best from a pragmatic standpoint? What do you think?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] shubhamvishu commented on pull request #12427: StringsToAutomaton#build to take List as parameter instead of Collection

Posted by "shubhamvishu (via GitHub)" <gi...@apache.org>.

shubhamvishu commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1673498218

   > I still think the explicit validation is better in this situation, but the counter-argument would be a user that knows for sure they're always properly providing sorted input and don't want the overhead of the validation.
   
   Agreed! As we discussed above one way could be to sort based on user input that if the collection is sorted or not but since that will be for only one flavour it might be confusing. I'll try to benchmark this change to see if its causing any regression or not.
   
   > I think it's so that next can throw IOException? You can't do that with Iterator<BytesRef>.
   
   Ahh I see. That makes sense. Thanks @gsmiller !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [PR] StringsToAutomaton#build to take List as parameter instead of Collection [lucene]

Posted by "shubhamvishu (via GitHub)" <gi...@apache.org>.

shubhamvishu commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1782977506

   @gsmiller I finally got a chance to run the benchmarks for this change. Below are the results(looks all good to me). Let me know what do you think? Thanks!
   
   ```
                     TaskQPS baseline  StdDevQPS my_modified_version      StdDev      Pct diff             p-value
                         AndHighMed       96.52      (6.5%)       95.49      (6.3%)   -1.1% ( -12% -   12%) 0.597
                        LowSpanNear      133.61      (4.7%)      132.45      (4.0%)   -0.9% (  -9% -    8%) 0.528
                          OrHighMed       67.95      (3.2%)       67.37      (4.6%)   -0.9% (  -8% -    7%) 0.492
                        MedSpanNear        5.79      (5.2%)        5.74      (4.2%)   -0.8% (  -9% -    8%) 0.573
                       OrNotHighLow      322.30      (2.3%)      319.70      (2.0%)   -0.8% (  -4% -    3%) 0.233
               BrowseDateSSDVFacets        0.90      (8.2%)        0.89      (9.4%)   -0.8% ( -17% -   18%) 0.773
                         AndHighLow      355.80      (4.2%)      352.99      (4.1%)   -0.8% (  -8% -    7%) 0.550
               MedTermDayTaxoFacets        5.85      (6.2%)        5.81      (5.7%)   -0.7% ( -11% -   11%) 0.713
                       OrNotHighMed      222.70      (3.6%)      221.51      (4.0%)   -0.5% (  -7% -    7%) 0.657
                       OrHighNotMed      194.41      (3.6%)      193.40      (3.4%)   -0.5% (  -7% -    6%) 0.641
                          MedPhrase       44.71      (3.1%)       44.50      (3.4%)   -0.5% (  -6% -    6%) 0.639
                      OrNotHighHigh      179.60      (3.7%)      178.85      (3.8%)   -0.4% (  -7% -    7%) 0.726
                       HighSpanNear        8.33      (4.6%)        8.30      (4.1%)   -0.4% (  -8% -    8%) 0.765
                       OrHighNotLow      331.53      (3.8%)      330.15      (3.7%)   -0.4% (  -7% -    7%) 0.728
                      OrHighNotHigh      248.18      (3.5%)      247.20      (3.4%)   -0.4% (  -7% -    6%) 0.718
                MedIntervalsOrdered        4.66      (4.6%)        4.64      (4.3%)   -0.4% (  -8% -    8%) 0.780
             OrHighMedDayTaxoFacets        4.39      (3.9%)        4.38      (4.2%)   -0.3% (  -8% -    8%) 0.820
                          OrHighLow      569.67      (2.6%)      568.04      (2.6%)   -0.3% (  -5% -    5%) 0.733
                LowIntervalsOrdered        2.64      (4.6%)        2.64      (4.6%)   -0.2% (  -8% -    9%) 0.899
               HighTermTitleBDVSort        5.20      (4.0%)        5.19      (4.1%)   -0.2% (  -7% -    8%) 0.895
                         OrHighHigh       35.47      (4.4%)       35.42      (5.8%)   -0.1% (  -9% -   10%) 0.929
          BrowseDayOfYearTaxoFacets        3.80     (15.9%)        3.80     (16.0%)   -0.1% ( -27% -   37%) 0.984
               BrowseDateTaxoFacets        3.78     (15.5%)        3.78     (15.5%)   -0.0% ( -26% -   36%) 0.995
                        AndHighHigh       16.35      (3.6%)       16.35      (7.5%)    0.0% ( -10% -   11%) 0.996
            AndHighMedDayTaxoFacets       49.40      (1.6%)       49.41      (1.7%)    0.0% (  -3% -    3%) 0.975
                          LowPhrase      254.05      (3.6%)      254.16      (4.0%)    0.0% (  -7% -    7%) 0.972
                         HighPhrase       40.53      (3.1%)       40.56      (3.6%)    0.1% (  -6% -    6%) 0.939
                           PKLookup      131.32      (2.3%)      131.44      (2.0%)    0.1% (  -4% -    4%) 0.891
        BrowseRandomLabelTaxoFacets        3.28     (15.0%)        3.28     (14.8%)    0.1% ( -25% -   35%) 0.983
              BrowseMonthTaxoFacets        4.13     (34.7%)        4.13     (34.7%)    0.1% ( -51% -  106%) 0.991
                             Fuzzy1       65.72      (1.1%)       65.81      (1.2%)    0.1% (  -2% -    2%) 0.714
                            LowTerm      513.23      (2.9%)      513.99      (2.9%)    0.1% (  -5% -    6%) 0.870
               HighIntervalsOrdered        1.62      (5.3%)        1.62      (5.0%)    0.2% (  -9% -   11%) 0.909
          BrowseDayOfYearSSDVFacets        3.75     (12.4%)        3.76      (8.6%)    0.2% ( -18% -   24%) 0.956
                            MedTerm      382.67      (3.7%)      383.49      (3.9%)    0.2% (  -7% -    8%) 0.857
                    LowSloppyPhrase       12.31      (2.4%)       12.34      (2.1%)    0.2% (  -4% -    4%) 0.754
                            Respell       36.92      (1.6%)       37.02      (1.6%)    0.3% (  -2% -    3%) 0.617
                    MedSloppyPhrase       22.87      (3.2%)       22.94      (3.0%)    0.3% (  -5% -    6%) 0.751
                           HighTerm      251.80      (4.3%)      252.92      (4.6%)    0.4% (  -8% -    9%) 0.752
           AndHighHighDayTaxoFacets        7.39      (2.7%)        7.43      (2.6%)    0.5% (  -4% -    5%) 0.571
                             Fuzzy2       39.80      (1.4%)       40.02      (1.8%)    0.5% (  -2% -    3%) 0.288
                   HighSloppyPhrase        1.88      (4.3%)        1.89      (4.2%)    0.6% (  -7% -    9%) 0.657
                           Wildcard       39.56      (3.9%)       39.82      (2.7%)    0.7% (  -5% -    7%) 0.537
                            Prefix3       77.73      (6.4%)       78.39      (5.8%)    0.8% ( -10% -   13%) 0.663
                  HighTermMonthSort     2196.21      (3.5%)     2217.20      (3.5%)    1.0% (  -5% -    8%) 0.389
              HighTermDayOfYearSort      200.38      (3.6%)      203.21      (3.2%)    1.4% (  -5% -    8%) 0.186
              BrowseMonthSSDVFacets        4.46     (13.1%)        4.53      (9.4%)    1.5% ( -18% -   27%) 0.687
        BrowseRandomLabelSSDVFacets        2.68      (7.7%)        2.73      (4.5%)    1.6% (  -9% -   14%) 0.421
                             IntNRQ       17.08      (4.8%)       17.41      (7.4%)    1.9% (  -9% -   14%) 0.335
                         TermDTSort      102.14      (2.9%)      104.31      (5.5%)    2.1% (  -6% -   10%) 0.127
                  HighTermTitleSort       38.07     (11.1%)       39.53      (9.0%)    3.8% ( -14% -   26%) 0.228
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [PR] StringsToAutomaton#build to take List as parameter instead of Collection [lucene]

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.

gsmiller merged PR #12427:
URL: https://github.com/apache/lucene/pull/12427


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [PR] StringsToAutomaton#build to take List as parameter instead of Collection [lucene]

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.

gsmiller commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1789348091

   @benwtrent glad it was a simple fix. Sorry it created churn!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gautamworah96 commented on pull request #12427: StringsToAutomaton#build to take List as parameter instead of Collection

Posted by "gautamworah96 (via GitHub)" <gi...@apache.org>.

gautamworah96 commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1631629455

   I was looking at this problem too and the change in test classes to meld arbitrary collections to `Lists` irked me as well. 
    
   +1 to throw an `IllegalArgumentException` in `#add` instead of having the user debug the error later on. 
   
   On a related note, since we only traverse the items in their given order, maybe we can relax `StringsToAutomaton#build(Collection<BytesRef>, boolean)` to `StringsToAutomaton#build(Iterable<BytesRef> input, boolean)`. This change will also make it more consistent with the `build(BytesRefIterator input, boolean asBinary)` method.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] shubhamvishu commented on pull request #12427: StringsToAutomaton#build to take List as parameter instead of Collection

Posted by "shubhamvishu (via GitHub)" <gi...@apache.org>.

shubhamvishu commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1633502769

   On the same note, since both the methods expects `Iterable` or `Iterators` why do we even need 2 separate methods here which are doing exactly the same thing i.e. iterating over the `ByteRef`'s and adding to automaton. 
   
   It would have been much better  if `BytesRefIterator` implemented the `Iterable`, `Iterator` interfaces in which case we could just have one method `StringsToAutomaton#build(Iterable)` which takes an `Iterable`. I don't see why do we even have a separate `BytesRefIterator` interface and not just using `Iterator` instead(maybe because its legacy code?) or is it that I'm missing something important here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [PR] StringsToAutomaton#build to take List as parameter instead of Collection [lucene]

Posted by "dweiss (via GitHub)" <gi...@apache.org>.

dweiss commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1789519097

   See #12742 - it's a problem with out javac task settings. Running a clean is a workaround. I will provide a fix for this when I get a chance, it's not a critical issue (but it is a bug in how we currently set up task dependencies).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] shubhamvishu commented on pull request #12427: StringsToAutomaton#build to take List as parameter instead of Collection

Posted by "shubhamvishu (via GitHub)" <gi...@apache.org>.

shubhamvishu commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1633489589

   > This is a situation where we really cannot sort on behalf of the caller, so it might be a bit confusing/trappy to sort some flavors of this method but not others? Maybe it's best to leave these methods as they are?
   
   Agreed @gsmiller! Yes it does seem maybe its better leave this as is.
   
   > we could look at changing the assert on line 276 of StringsToAutomaton to throw an explicit IllegalArgumentException so that we don't silently built a corrupt automaton on unordered input (with asserts disabled). There would add overhead since we have to now keep track of the previous term all the time, but maybe it's worth benchmarking and considering this change?
   
   I like the idea to throw IAE if wrong input is provided. I think this would only affect the cases with assert disabled? otherwise with asserts enabled we anyways always keep track of previous term.
   
   > maybe we can relax StringsToAutomaton#build(Collection<BytesRef>, boolean) to StringsToAutomaton#build(Iterable<BytesRef> input, boolean). This change will also make it more consistent with the build(BytesRefIterator input, boolean asBinary) method.
   
   Good point @gautamworah96. `Iterable` seems like a better way instead of `Collection`  and would indeed be consistent with other method. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #12427: StringsToAutomaton#build to take List as parameter instead of Collection

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.

gsmiller commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1644602539

> I like the idea to throw IAE if wrong input is provided. I think this would only affect the cases with assert disabled? otherwise with asserts enabled we anyways always keep track of previous term.

That's right. I imagine a lot of users disable asserts when running in a production environment for performance reasons, so the only concern here is that we're adding some new work for them to do this validation. I still think the explicit validation is better in this situation, but the counter-argument would be a user that knows for sure they're always properly providing sorted input and don't want the overhead of the validation.

As for `Iterable` vs. `Collection`, I agree that `Iterable` is more appropriate for the API. Whether-or-not it's worth making that change though in a back-compat way is maybe a different question?

> I don't see why do we even have a separate BytesRefIterator interface and not just using Iterator instead(maybe because its legacy code?) or is it that I'm missing something important here?

I think it's so that `next` can throw `IOException`? You can't do that with `Iterator<BytesRef>`.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

Re: [PR] StringsToAutomaton#build to take List as parameter instead of Collection [lucene]

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.

gsmiller commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1789264349

   @benwtrent is this an issue of the code only being partially rebuilt? The signature for `Automata#makeStringUnion` was changed to accept a more general `Iterable` as opposed to a `Collection` in this change. Let me see if I can reproduce it locally as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org