You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "shubhamvishu (via GitHub)" <gi...@apache.org> on 2023/07/09 20:26:30 UTC
[GitHub] [lucene] shubhamvishu opened a new pull request, #12427: StringsToAutomaton#build to take List as parameter instead of Collection
shubhamvishu opened a new pull request, #12427:
URL: https://github.com/apache/lucene/pull/12427
### Description
- This addresses issue #12319
- `StringsToAutomaton` previously called `DaciukMihovAutomatonBuilder`(renamed in this PR) should take List as input instead of Collection as it expects sorted input.
Closes #12319
<!--
If this is your first contribution to Lucene, please make sure you have reviewed the contribution guide.
https://github.com/apache/lucene/blob/main/CONTRIBUTING.md
-->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] StringsToAutomaton#build to take List as parameter instead of Collection [lucene]
Posted by "benwtrent (via GitHub)" <gi...@apache.org>.
benwtrent commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1789218896
Something weird is happening. This commit is causing the following failure:
```
./gradlew test --tests TestUnifiedHighlighterTermIntervals.testCustomFieldValueSource -Dtests.seed=9E4091CE5D400BA3
```
I am starting to investigate, but folks involved with this change might find the issue quicker.
```
> java.lang.NoSuchMethodError: 'org.apache.lucene.util.automaton.Automaton org.apache.lucene.util.automaton.Automata.makeStringUnion(java.util.Collection)'
> at org.apache.lucene.search.uhighlight.CharArrayMatcher.fromTerms(CharArrayMatcher.java:42)
> at org.apache.lucene.search.uhighlight.MemoryIndexOffsetStrategy.buildCombinedAutomaton(MemoryIndexOffsetStrategy.java:71)
> at org.apache.lucene.search.uhighlight.MemoryIndexOffsetStrategy.<init>(MemoryIndexOffsetStrategy.java:53)
> at org.apache.lucene.search.uhighlight.UnifiedHighlighter.getOffsetStrategy(UnifiedHighlighter.java:1238)
> at org.apache.lucene.search.uhighlight.UnifiedHighlighter.getFieldHighlighter(UnifiedHighlighter.java:1072)
> at org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlightFieldsAsObjects(UnifiedHighlighter.java:880)
> at org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlightFields(UnifiedHighlighter.java:814)
> at org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlightFields(UnifiedHighlighter.java:792)
> at org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlight(UnifiedHighlighter.java:725)
> at org.apache.lucene.search.uhighlight.TestUnifiedHighlighterTermIntervals.testCustomFieldValueSource(TestUnifiedHighlighterTermIntervals.java:607)
> at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
> at java.base/java.lang.reflect.Method.invoke(Method.java:580)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> at junit@4.13.1/org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
> at junit@4.13.1/org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
> at java.base/java.lang.Thread.run(Thread.java:1583)
>
> java.lang.RuntimeException: MockDirectoryWrapper: cannot close: there are still 7 open files: {_0_Lucene90_0.pos=1, _0_Lucene90_0.doc=1, _0_Lucene90_0.tim=1, _0.nvd=1, _0.fdx=1, _0.fdt=1, _0_Lucene90_0.tip=1}
> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:876)
> at org.apache.lucene.search.uhighlight.TestUnifiedHighlighterTermIntervals.doAfter(TestUnifiedHighlighterTermIntervals.java:84)
> at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
> at java.base/java.lang.reflect.Method.invoke(Method.java:580)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:1004)
> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> at junit@4.13.1/org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
> at junit@4.13.1/org.junit.rules.RunRules.evaluate(RunRules.java:20)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
> at java.base/java.lang.Thread.run(Thread.java:1583)
>
> Caused by:
> java.lang.RuntimeException: unclosed IndexInput: _0.nvd
> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.store.MockDirectoryWrapper.addFileHandle(MockDirectoryWrapper.java:783)
> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:835)
> at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.codecs.lucene90.Lucene90NormsProducer.<init>(Lucene90NormsProducer.java:84)
> at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.codecs.lucene90.Lucene90NormsFormat.normsProducer(Lucene90NormsFormat.java:96)
> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.codecs.asserting.AssertingNormsFormat.normsProducer(AssertingNormsFormat.java:46)
> at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:108)
> at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:95)
> at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:180)
> at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:222)
> at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.IndexWriter.lambda$getReader$0(IndexWriter.java:539)
> at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:138)
> at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:601)
> at org.apache.lucene.core@10.0.0-SNAPSHOT/org.apache.lucene.index.IndexWriter$4.getReader(IndexWriter.java:6510)
> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.index.RandomIndexWriter.getReader(RandomIndexWriter.java:508)
> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.index.RandomIndexWriter.getReader(RandomIndexWriter.java:426)
> at org.apache.lucene.search.uhighlight.TestUnifiedHighlighterTermIntervals.testCustomFieldValueSource(TestUnifiedHighlighterTermIntervals.java:581)
> at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
> at java.base/java.lang.reflect.Method.invoke(Method.java:580)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
> at randomizedtesting.runner@2.8.1/com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] StringsToAutomaton#build to take List as parameter instead of Collection [lucene]
Posted by "shubhamvishu (via GitHub)" <gi...@apache.org>.
shubhamvishu commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1790129043
Nice!....I like it how we discovered a build issue bug(which was hidden for sometime now I think) and fixed it. Thanks for solving @dweiss!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] StringsToAutomaton#build to take List as parameter instead of Collection [lucene]
Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1784156071
> Below are the results(looks all good to me). Let me know what do you think? Thanks!
+1 -- looks like just noise to me.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] StringsToAutomaton#build to take List as parameter instead of Collection [lucene]
Posted by "benwtrent (via GitHub)" <gi...@apache.org>.
benwtrent commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1789266991
@gsmiller you are 100% correct 🤦
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #12427: StringsToAutomaton#build to take List as parameter instead of Collection
Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1631550554
Yeah, good points/questions! I'd be curious how much overhead it would actually add to sort the input when it's already sorted?
But to take a step back for a moment, we also have automata building methods that accept a `BytesRefIterator` as well, which also must be sorted. This is a situation where we really cannot sort on behalf of the caller, so it might be a bit confusing/trappy to sort some flavors of this method but not others? Maybe it's best to leave these methods as they are?
If we want to make these functions a bit more user-friendly, we could look at changing the `assert` on line 276 of `StringsToAutomaton` to throw an explicit `IllegalArgumentException` so that we don't silently built a corrupt automaton on unordered input (with asserts disabled). There _would_ add overhead since we have to now keep track of the previous term all the time, but maybe it's worth benchmarking and considering this change? I _do_ think it's best to explicitly let the user know they passed invalid input in a case like this, so it would be a nice change if it didn't introduce a significant performance drag.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] StringsToAutomaton#build to take List as parameter instead of Collection [lucene]
Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1785924607
+1 looks good to me as well. I like that this small change, 1) makes the API a little more general, allowing users to provide any Iterable instead of Collection, and 2) adds an explicit check for expected ordering of the iterable using IAE instead of relying on asserts being enabled. Thanks for following up!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] shubhamvishu commented on pull request #12427: StringsToAutomaton#build to take List as parameter instead of Collection
Posted by "shubhamvishu (via GitHub)" <gi...@apache.org>.
shubhamvishu commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1630241517
@gsmiller I totally agree its not helpful if the input/data `Collection` is sorted since its a unnecessary overhead to convert to a `List` here. As per `TermInSetQuery` ctor its checking if the `collection` is of `SortedSet` type and uses natural ordering. But what if the user is explicitly passing a List (or any `Collection` for that matter) which is already sorted but is not `SortedSet`?
Could we expose another `#build` function as `build(Collection<BytesRef> input, boolean asBinary, boolean isSorted)` or for better maybe `makeStringUnion(Collection<BytesRef> utf8Strings, boolean isSorted)` (similar to the one in tests in this PR) to completely bypass the sorting/checks if incase `isSorted == true` else we could have the same check as in `TermInSetquery` ctor and sort the terms incase they are not before moving on in `Automata#makeStringUnion` as you mentioned? Let me know your thoughts?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #12427: StringsToAutomaton#build to take List as parameter instead of Collection
Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1629757874
Thanks @shubhamvishu for working on this! As I look at the PR, I'm wondering if accepting a `List` is really the proper thing to do here. If users already have a sorted `Collection` (like `SortedSet`), are we forcing them to make unnecessary copies of their data to adhere to our API?
I wonder if the ctor for `TermInSetQuery` might serve as a reasonable example of handling sortedness? `TermInSetQuery` accepts a `Collection` but tries to see if it's already sorted (and de-duped). If it is, it uses it as-is. If not, it sorts and de-dupes on behalf of the user. The automaton building logic right now only has a simple `assert` down in the internals, which could be disabled depending on the runtime environment, resulting in tricky debugging if the user didn't handle sorting their data properly.
Since `StringsToAutomaton` is an internal class, maybe one approach would be to leave it untouched, letting it expect pre-sorted data and having a simple `assert` statement as it does today to check that (for easier debugging), but add sorting logic to `Automata#makeStringUnion`? What if that method accepted a `Collection` as it does today but had similar logic as the `TermInSetQuery` ctor related to sorting?
Apologies for leading us in the wrong direction in the issue I opened. I think a `List` is a bit more semantically appropriate than a `Collection`, but I'm not sure it's best from a pragmatic standpoint? What do you think?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] shubhamvishu commented on pull request #12427: StringsToAutomaton#build to take List as parameter instead of Collection
Posted by "shubhamvishu (via GitHub)" <gi...@apache.org>.
shubhamvishu commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1673498218
> I still think the explicit validation is better in this situation, but the counter-argument would be a user that knows for sure they're always properly providing sorted input and don't want the overhead of the validation.
Agreed! As we discussed above one way could be to sort based on user input that if the collection is sorted or not but since that will be for only one flavour it might be confusing. I'll try to benchmark this change to see if its causing any regression or not.
> I think it's so that next can throw IOException? You can't do that with Iterator<BytesRef>.
Ahh I see. That makes sense. Thanks @gsmiller !
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] StringsToAutomaton#build to take List as parameter instead of Collection [lucene]
Posted by "shubhamvishu (via GitHub)" <gi...@apache.org>.
shubhamvishu commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1782977506
@gsmiller I finally got a chance to run the benchmarks for this change. Below are the results(looks all good to me). Let me know what do you think? Thanks!
```
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
AndHighMed 96.52 (6.5%) 95.49 (6.3%) -1.1% ( -12% - 12%) 0.597
LowSpanNear 133.61 (4.7%) 132.45 (4.0%) -0.9% ( -9% - 8%) 0.528
OrHighMed 67.95 (3.2%) 67.37 (4.6%) -0.9% ( -8% - 7%) 0.492
MedSpanNear 5.79 (5.2%) 5.74 (4.2%) -0.8% ( -9% - 8%) 0.573
OrNotHighLow 322.30 (2.3%) 319.70 (2.0%) -0.8% ( -4% - 3%) 0.233
BrowseDateSSDVFacets 0.90 (8.2%) 0.89 (9.4%) -0.8% ( -17% - 18%) 0.773
AndHighLow 355.80 (4.2%) 352.99 (4.1%) -0.8% ( -8% - 7%) 0.550
MedTermDayTaxoFacets 5.85 (6.2%) 5.81 (5.7%) -0.7% ( -11% - 11%) 0.713
OrNotHighMed 222.70 (3.6%) 221.51 (4.0%) -0.5% ( -7% - 7%) 0.657
OrHighNotMed 194.41 (3.6%) 193.40 (3.4%) -0.5% ( -7% - 6%) 0.641
MedPhrase 44.71 (3.1%) 44.50 (3.4%) -0.5% ( -6% - 6%) 0.639
OrNotHighHigh 179.60 (3.7%) 178.85 (3.8%) -0.4% ( -7% - 7%) 0.726
HighSpanNear 8.33 (4.6%) 8.30 (4.1%) -0.4% ( -8% - 8%) 0.765
OrHighNotLow 331.53 (3.8%) 330.15 (3.7%) -0.4% ( -7% - 7%) 0.728
OrHighNotHigh 248.18 (3.5%) 247.20 (3.4%) -0.4% ( -7% - 6%) 0.718
MedIntervalsOrdered 4.66 (4.6%) 4.64 (4.3%) -0.4% ( -8% - 8%) 0.780
OrHighMedDayTaxoFacets 4.39 (3.9%) 4.38 (4.2%) -0.3% ( -8% - 8%) 0.820
OrHighLow 569.67 (2.6%) 568.04 (2.6%) -0.3% ( -5% - 5%) 0.733
LowIntervalsOrdered 2.64 (4.6%) 2.64 (4.6%) -0.2% ( -8% - 9%) 0.899
HighTermTitleBDVSort 5.20 (4.0%) 5.19 (4.1%) -0.2% ( -7% - 8%) 0.895
OrHighHigh 35.47 (4.4%) 35.42 (5.8%) -0.1% ( -9% - 10%) 0.929
BrowseDayOfYearTaxoFacets 3.80 (15.9%) 3.80 (16.0%) -0.1% ( -27% - 37%) 0.984
BrowseDateTaxoFacets 3.78 (15.5%) 3.78 (15.5%) -0.0% ( -26% - 36%) 0.995
AndHighHigh 16.35 (3.6%) 16.35 (7.5%) 0.0% ( -10% - 11%) 0.996
AndHighMedDayTaxoFacets 49.40 (1.6%) 49.41 (1.7%) 0.0% ( -3% - 3%) 0.975
LowPhrase 254.05 (3.6%) 254.16 (4.0%) 0.0% ( -7% - 7%) 0.972
HighPhrase 40.53 (3.1%) 40.56 (3.6%) 0.1% ( -6% - 6%) 0.939
PKLookup 131.32 (2.3%) 131.44 (2.0%) 0.1% ( -4% - 4%) 0.891
BrowseRandomLabelTaxoFacets 3.28 (15.0%) 3.28 (14.8%) 0.1% ( -25% - 35%) 0.983
BrowseMonthTaxoFacets 4.13 (34.7%) 4.13 (34.7%) 0.1% ( -51% - 106%) 0.991
Fuzzy1 65.72 (1.1%) 65.81 (1.2%) 0.1% ( -2% - 2%) 0.714
LowTerm 513.23 (2.9%) 513.99 (2.9%) 0.1% ( -5% - 6%) 0.870
HighIntervalsOrdered 1.62 (5.3%) 1.62 (5.0%) 0.2% ( -9% - 11%) 0.909
BrowseDayOfYearSSDVFacets 3.75 (12.4%) 3.76 (8.6%) 0.2% ( -18% - 24%) 0.956
MedTerm 382.67 (3.7%) 383.49 (3.9%) 0.2% ( -7% - 8%) 0.857
LowSloppyPhrase 12.31 (2.4%) 12.34 (2.1%) 0.2% ( -4% - 4%) 0.754
Respell 36.92 (1.6%) 37.02 (1.6%) 0.3% ( -2% - 3%) 0.617
MedSloppyPhrase 22.87 (3.2%) 22.94 (3.0%) 0.3% ( -5% - 6%) 0.751
HighTerm 251.80 (4.3%) 252.92 (4.6%) 0.4% ( -8% - 9%) 0.752
AndHighHighDayTaxoFacets 7.39 (2.7%) 7.43 (2.6%) 0.5% ( -4% - 5%) 0.571
Fuzzy2 39.80 (1.4%) 40.02 (1.8%) 0.5% ( -2% - 3%) 0.288
HighSloppyPhrase 1.88 (4.3%) 1.89 (4.2%) 0.6% ( -7% - 9%) 0.657
Wildcard 39.56 (3.9%) 39.82 (2.7%) 0.7% ( -5% - 7%) 0.537
Prefix3 77.73 (6.4%) 78.39 (5.8%) 0.8% ( -10% - 13%) 0.663
HighTermMonthSort 2196.21 (3.5%) 2217.20 (3.5%) 1.0% ( -5% - 8%) 0.389
HighTermDayOfYearSort 200.38 (3.6%) 203.21 (3.2%) 1.4% ( -5% - 8%) 0.186
BrowseMonthSSDVFacets 4.46 (13.1%) 4.53 (9.4%) 1.5% ( -18% - 27%) 0.687
BrowseRandomLabelSSDVFacets 2.68 (7.7%) 2.73 (4.5%) 1.6% ( -9% - 14%) 0.421
IntNRQ 17.08 (4.8%) 17.41 (7.4%) 1.9% ( -9% - 14%) 0.335
TermDTSort 102.14 (2.9%) 104.31 (5.5%) 2.1% ( -6% - 10%) 0.127
HighTermTitleSort 38.07 (11.1%) 39.53 (9.0%) 3.8% ( -14% - 26%) 0.228
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] StringsToAutomaton#build to take List as parameter instead of Collection [lucene]
Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller merged PR #12427:
URL: https://github.com/apache/lucene/pull/12427
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] StringsToAutomaton#build to take List as parameter instead of Collection [lucene]
Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1789348091
@benwtrent glad it was a simple fix. Sorry it created churn!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on pull request #12427: StringsToAutomaton#build to take List as parameter instead of Collection
Posted by "gautamworah96 (via GitHub)" <gi...@apache.org>.
gautamworah96 commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1631629455
I was looking at this problem too and the change in test classes to meld arbitrary collections to `Lists` irked me as well.
+1 to throw an `IllegalArgumentException` in `#add` instead of having the user debug the error later on.
On a related note, since we only traverse the items in their given order, maybe we can relax `StringsToAutomaton#build(Collection<BytesRef>, boolean)` to `StringsToAutomaton#build(Iterable<BytesRef> input, boolean)`. This change will also make it more consistent with the `build(BytesRefIterator input, boolean asBinary)` method.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] shubhamvishu commented on pull request #12427: StringsToAutomaton#build to take List as parameter instead of Collection
Posted by "shubhamvishu (via GitHub)" <gi...@apache.org>.
shubhamvishu commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1633502769
On the same note, since both the methods expects `Iterable` or `Iterators` why do we even need 2 separate methods here which are doing exactly the same thing i.e. iterating over the `ByteRef`'s and adding to automaton.
It would have been much better if `BytesRefIterator` implemented the `Iterable`, `Iterator` interfaces in which case we could just have one method `StringsToAutomaton#build(Iterable)` which takes an `Iterable`. I don't see why do we even have a separate `BytesRefIterator` interface and not just using `Iterator` instead(maybe because its legacy code?) or is it that I'm missing something important here?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] StringsToAutomaton#build to take List as parameter instead of Collection [lucene]
Posted by "dweiss (via GitHub)" <gi...@apache.org>.
dweiss commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1789519097
See #12742 - it's a problem with out javac task settings. Running a clean is a workaround. I will provide a fix for this when I get a chance, it's not a critical issue (but it is a bug in how we currently set up task dependencies).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] shubhamvishu commented on pull request #12427: StringsToAutomaton#build to take List as parameter instead of Collection
Posted by "shubhamvishu (via GitHub)" <gi...@apache.org>.
shubhamvishu commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1633489589
> This is a situation where we really cannot sort on behalf of the caller, so it might be a bit confusing/trappy to sort some flavors of this method but not others? Maybe it's best to leave these methods as they are?
Agreed @gsmiller! Yes it does seem maybe its better leave this as is.
> we could look at changing the assert on line 276 of StringsToAutomaton to throw an explicit IllegalArgumentException so that we don't silently built a corrupt automaton on unordered input (with asserts disabled). There would add overhead since we have to now keep track of the previous term all the time, but maybe it's worth benchmarking and considering this change?
I like the idea to throw IAE if wrong input is provided. I think this would only affect the cases with assert disabled? otherwise with asserts enabled we anyways always keep track of previous term.
> maybe we can relax StringsToAutomaton#build(Collection<BytesRef>, boolean) to StringsToAutomaton#build(Iterable<BytesRef> input, boolean). This change will also make it more consistent with the build(BytesRefIterator input, boolean asBinary) method.
Good point @gautamworah96. `Iterable` seems like a better way instead of `Collection` and would indeed be consistent with other method.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #12427: StringsToAutomaton#build to take List as parameter instead of Collection
Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1644602539
> I like the idea to throw IAE if wrong input is provided. I think this would only affect the cases with assert disabled? otherwise with asserts enabled we anyways always keep track of previous term.
That's right. I imagine a lot of users disable asserts when running in a production environment for performance reasons, so the only concern here is that we're adding some new work for them to do this validation. I still think the explicit validation is better in this situation, but the counter-argument would be a user that knows for sure they're always properly providing sorted input and don't want the overhead of the validation.
As for `Iterable` vs. `Collection`, I agree that `Iterable` is more appropriate for the API. Whether-or-not it's worth making that change though in a back-compat way is maybe a different question?
> I don't see why do we even have a separate BytesRefIterator interface and not just using Iterator instead(maybe because its legacy code?) or is it that I'm missing something important here?
I think it's so that `next` can throw `IOException`? You can't do that with `Iterator<BytesRef>`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org
Re: [PR] StringsToAutomaton#build to take List as parameter instead of Collection [lucene]
Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12427:
URL: https://github.com/apache/lucene/pull/12427#issuecomment-1789264349
@benwtrent is this an issue of the code only being partially rebuilt? The signature for `Automata#makeStringUnion` was changed to accept a more general `Iterable` as opposed to a `Collection` in this change. Let me see if I can reproduce it locally as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org