You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Robert Muir (Jira)" <ji...@apache.org> on 2021/05/29 00:11:00 UTC
[jira] [Commented] (LUCENE-9981) CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit

    [ https://issues.apache.org/jira/browse/LUCENE-9981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353637#comment-17353637 ] 

Robert Muir commented on LUCENE-9981:
-------------------------------------

Before we {{det(reverse())}} this automaton to compute the common suffix for the optimization, note that it has:
* 52000 states
* 84000 transitions

I think we should just not even bother to do this optimization if the automaton is "big" (say more than 1,000 states or more than 1,000 transitions). 

Something like this is a 1-line change:
{noformat}
--- a/lucene/core/src/java/org/apache/lucene/util/automaton/CompiledAutomaton.java
+++ b/lucene/core/src/java/org/apache/lucene/util/automaton/CompiledAutomaton.java
@@ -237,7 +237,7 @@ public class CompiledAutomaton implements Accountable {
       binary = new UTF32ToUTF8().convert(automaton);
     }

-    if (this.finite) {
+    if (this.finite || binary.getNumStates() > 1000 || binary.getNumTransitions() > 1000) {
       commonSuffixRef = null;
     } else {
       // NOTE: this is a very costly operation!  We should test if it's really warranted in
{noformat}

This means the test passes in 5 seconds instead of failing with exception after 268 seconds. I think its a good general fix for this specific method.

But still, I think there might be more interesting problems here: 5 seconds still sucks, especially on a 0-document index. And you can be sure I don't want to add a unit test taking 5 seconds of cpu time :)



> CompiledAutomaton.getCommonSuffix can be extraordinarily slow, even with default maxDeterminizedStates limit
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-9981
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9981
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Robert Muir
>            Priority: Major
>         Attachments: LUCENE-9981_test.patch
>
>
> We have a {{maxDeterminizedStates = 10000}} limit designed to keep regexp-type queries from blowing up. 
> But we have an adversary that will run for 268s on my laptop before hitting exception, first reported here: https://github.com/opensearch-project/OpenSearch/issues/687
> When I run the test and jstack the threads, this what I see:
> {noformat}
> "TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x00007fff7006ca80 nid=0x231c8 runnable  [0x00007fff8b7f0000]
>    java.lang.Thread.State: RUNNABLE
> 	at org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106)
> 	at org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769)
> 	at org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155)
> 	at org.apache.lucene.util.automaton.CompiledAutomaton.<init>(CompiledAutomaton.java:247)
> 	at org.apache.lucene.search.AutomatonQuery.<init>(AutomatonQuery.java:104)
> 	at org.apache.lucene.search.AutomatonQuery.<init>(AutomatonQuery.java:82)
> 	at org.apache.lucene.search.RegexpQuery.<init>(RegexpQuery.java:138)
> 	at org.apache.lucene.search.RegexpQuery.<init>(RegexpQuery.java:114)
> 	at org.apache.lucene.search.RegexpQuery.<init>(RegexpQuery.java:72)
> 	at org.apache.lucene.search.RegexpQuery.<init>(RegexpQuery.java:62)
> 	at org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42)
> {noformat}
> This is really sad, as {{getCommonSuffixBytesRef()}} is only supposed to be an "up-front" optimization to make the actual subsequent terms-intensive part of the query faster. But it makes the whole query run for nearly 5 minutes before it does anything.
> So I definitely think we should improve {{getCommonSuffixBytesRef}} to be more "best-effort". For example, it can reduce the lower bound to {{1000}} and catch the exception like such:
> {code}
> try {
>    // this is slow, and just an opto anyway, so don't burn cycles on it for some crazy worst-case.
>    // if we don't set this common suffix, the query will just run a bit slower, that's all.
>    int limit = Math.min(1000, maxDeterminizedStates);
>    BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit);
>    ... (setting commonSuffixRef)
> } catch (TooComplexTooDeterminizeException notWorthIt) {
>   commonSuffixRef = null;
> }
> {code}
> Another, maybe simpler option, is to just check that input state/transitions accounts don't exceed some low limit N.
> Basically this opto is geared at stuff like leading wildcard query of "*foo". By computing that the common suffix is "foo" we can spend less CPU in the terms dictionary because we can first do a memcmp before having to run data thru any finite state machine. It's really a microopt and we shouldn't be spending whole seconds of cpu on it, ever.
> But I still don't quite understand how the current limits are giving the behavior today, maybe there is a bigger issue and I don't want to shove something under the rug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org