You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/07/26 21:29:17 UTC

[GitHub] [lucene] zhaih opened a new pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

zhaih opened a new pull request #225:
URL: https://github.com/apache/lucene/pull/225

<!--
_(If you are a project committer then you may remove some/all of the following template.)_

Before creating a pull request, please file an issue in the ASF Jira system for Lucene:

* https://issues.apache.org/jira/projects/LUCENE

You will need to create an account in Jira in order to create an issue.

The title of the PR should reference the Jira issue number in the form:

* LUCENE-####: <short description of problem or changes>

LUCENE must be fully capitalized. A short description helps people scanning pull requests for items they can work on.

Properly referencing the issue in the title ensures that Jira is correctly updated with code review comments and commits. -->

# Description

https://issues.apache.org/jira/browse/LUCENE-10010

Introduces `NFARunAutomaton` to run NFA directly

Works to to:
1. Integrate with current `RunAutomaton` class hierarchy
2. Further optimize the `NFARunAutomaton` implementation

# Tests

A unit test that assert the NFARunAutomaton behaves the same as the DFA one by using random generated regex strings

# Checklist

Please review the following and check all that apply:

- [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code conforms to the standards described there to the best of my ability.
- [x] I have created a Jira issue and added the issue ID to my pull request title.
- [x] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended)
- [x] I have developed this patch against the `main` branch.
- [x] I have run `./gradlew check`.
- [x] I have added tests for my changes.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] dweiss commented on pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

dweiss commented on pull request #225:
URL: https://github.com/apache/lucene/pull/225#issuecomment-941304810


   I re-ran the jobs. The stack trace from the error does look suspicious though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

rmuir commented on pull request #225:
URL: https://github.com/apache/lucene/pull/225#issuecomment-988451392


   ok, sorry, I realize the latest two PR-split-outs still don't solve your problem. The API is still ugly because we still `minimize()` regexps, and that implies `determinize()`... but maybe isn't obvious.
   
   I don't think we need to `minimize()` any more, I opened a separate discussion: https://issues.apache.org/jira/browse/LUCENE-10296. I will try to prototype a PR.
   
   Instead, the decisions on this issue should just be about DFA or NFA! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] zhaih commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

zhaih commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r761533567



##########
File path: lucene/CHANGES.txt
##########
@@ -7,6 +7,8 @@ http://s.apache.org/luceneversions
 
 New Features
 
+* LUCENE-10010 Introduce NFARunAutomaton to run NFA directly. (Haoyu Zhai)

Review comment:
       Ah nice catch!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] zhaih commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

zhaih commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r724697716



##########
File path: lucene/codecs/src/java/org/apache/lucene/codecs/memory/DirectPostingsFormat.java
##########
@@ -962,15 +964,22 @@ public ImpactsEnum impacts(int flags) throws IOException {
       private int stateUpto;
 
       public DirectIntersectTermsEnum(CompiledAutomaton compiled, BytesRef startTerm) {
-        runAutomaton = compiled.runAutomaton;
-        compiledAutomaton = compiled;
+        if (compiled.nfaRunAutomaton != null) {
+          this.runAutomaton = compiled.nfaRunAutomaton;

Review comment:
       I'm hesitating on doing that since there might be some method still rely on the truth that `runAutomaton` is of type `RunAutomaton` but not only using the method abstracted by `ByteRunnable`.
   
   Also it might be a bit easier to spot if the nfaRunAutomaton is having problem at this stage? Probably we want to merge it after the NFA one is more mature?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] zhaih commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

zhaih commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r724695030



##########
File path: lucene/core/src/java/org/apache/lucene/search/AutomatonQuery.java
##########
@@ -65,7 +66,19 @@
    * @param automaton Automaton to run, terms that are accepted are considered a match.
    */
   public AutomatonQuery(final Term term, Automaton automaton) {
-    this(term, automaton, Operations.DEFAULT_DETERMINIZE_WORK_LIMIT);
+    this(term, automaton, ByteRunnable.TYPE.DFA);
+  }
+
+  /**
+   * Create a new AutomatonQuery from an {@link Automaton}. Using specific type of RunAutomaton
+   *
+   * @param term Term containing field and possibly some pattern structure. The term text is
+   *     ignored.
+   * @param automaton Automaton to run, terms that are accepted are considered a match.
+   * @param runnableType NFA or DFA

Review comment:
       Improved! (I basically pointed to the previous improved javadoc lol)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] zhaih commented on pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

zhaih commented on pull request #225:
URL: https://github.com/apache/lucene/pull/225#issuecomment-941199775


   Hmmm one of the test failed
   ```
   ERROR: The following test(s) have failed:
   > Task :lucene:analysis:smartcn:test
     - org.apache.lucene.index.TestIndexFileDeleter.testExcInDecRef (:lucene:core)
   :lucene:analysis:smartcn:test (SUCCESS): 21 test(s)
       Test output: /home/runner/work/lucene/lucene/lucene/core/build/test-results/test/outputs/OUTPUT-org.apache.lucene.index.TestIndexFileDeleter.txt
   
       Reproduce with: gradlew :lucene:core:test --tests "org.apache.lucene.index.TestIndexFileDeleter.testExcInDecRef" -Ptests.jvms=1 -Ptests.jvmargs=-XX:TieredStopAtLevel=1 -Ptests.seed=FD14DA9475FFAE2C -Ptests.slow=false -Ptests.file.encoding=UTF-8
   ```
   But I can't reproduce locally...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] zhaih commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

zhaih commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r724698226



##########
File path: lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene40/blocktree/FieldReader.java
##########
@@ -187,6 +187,14 @@ public TermsEnum intersect(CompiledAutomaton compiled, BytesRef startTerm) throw
     if (compiled.type != CompiledAutomaton.AUTOMATON_TYPE.NORMAL) {
       throw new IllegalArgumentException("please use CompiledAutomaton.getTermsEnum instead");
     }
+    if (compiled.nfaRunAutomaton != null) {
+      return new IntersectTermsEnum(

Review comment:
       Sorry could you expand a bit, I don't quite understand this comment?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] dweiss edited a comment on pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

dweiss edited a comment on pull request #225:
URL: https://github.com/apache/lucene/pull/225#issuecomment-941302842


   The stack trace is interesting though - looks like access to a closed reader pool:
   ```
   org.apache.lucene.index.TestIndexFileDeleter > testExcInDecRef FAILED
       org.apache.lucene.store.AlreadyClosedException: ReaderPool is already closed
           at __randomizedtesting.SeedInfo.seed([FD14DA9475FFAE2C:1489ADA6033649D1]:0)
           at app//org.apache.lucene.index.ReaderPool.get(ReaderPool.java:400)
           at app//org.apache.lucene.index.IndexWriter.writeReaderPool(IndexWriter.java:3742)
           at app//org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:590)
           at app//org.apache.lucene.index.RandomIndexWriter.getReader(RandomIndexWriter.java:474)
           at app//org.apache.lucene.index.RandomIndexWriter.getReader(RandomIndexWriter.java:406)
           at app//org.apache.lucene.index.TestIndexFileDeleter.testExcInDecRef(TestIndexFileDeleter.java:484)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

rmuir commented on pull request #225:
URL: https://github.com/apache/lucene/pull/225#issuecomment-985972848


   ok, see https://github.com/apache/lucene/pull/513 which is another PR like the #485, just for `RegExp` class. all the trappy minimization is removed, it may return a DFA or NFA: and it is the callers choice to minimize. Combined with #485 I think it gives us the opportunity for a simple API, e.g. this stuff would just happen in one place (e.g. RegExpQuery class) rather than strewn all about.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] zhaih commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

zhaih commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r772610342



##########
File path: lucene/core/src/test/org/apache/lucene/util/automaton/TestNFARunAutomaton.java
##########
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import org.apache.lucene.util.IntsRef;
+import org.apache.lucene.util.LuceneTestCase;
+
+public class TestNFARunAutomaton extends LuceneTestCase {
+
+  public void testRandom() {
+    for (int i = 0; i < 100; i++) {
+      RegExp regExp = null;
+      while (regExp == null) {
+        try {
+          regExp = new RegExp(AutomatonTestUtil.randomRegexp(random()));
+        } catch (IllegalArgumentException e) {
+          ignoreException(e);

Review comment:
       Oh makes sense, tricky! Thank you!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

rmuir commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r762372861



##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
##########
@@ -614,16 +624,17 @@ public Automaton toAutomaton(AutomatonProvider automaton_provider, int determini
    */
   public Automaton toAutomaton(Map<String, Automaton> automata, int determinizeWorkLimit)
       throws IllegalArgumentException, TooComplexToDeterminizeException {
-    return toAutomaton(automata, null, determinizeWorkLimit);
+    return toAutomaton(automata, null, determinizeWorkLimit, true);
   }
 
   private Automaton toAutomaton(
       Map<String, Automaton> automata,
       AutomatonProvider automaton_provider,
-      int determinizeWorkLimit)
+      int determinizeWorkLimit,
+      boolean buildDFA)

Review comment:
       Similar to the other classes (AutomatonQuery, RunAutomaton, CompiledAutomaton), I don't think we should add booleans here.
   
   Instead, I'd rather remove `determinizeWorkLimit` and calls to `minimize()` everywhere from this thing. It is pretty crazy that it calls `minimize` at every "parsing step"!.
   
   I think `toAutomaton()` should just return an NFA, and if the caller wants to determinize or minimize it, that's up to them.
   
   There is an annoying twist, in that even with all these booleans for an NFA, determinize() still gets called if you use the "complement" operator. I'm not sure there is a way to implement this operator without exponential time. It is documented to be optional (although we enable it by default in our RegexpQuery), maybe we should seriously consider removing this operator?

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
##########
@@ -614,16 +624,17 @@ public Automaton toAutomaton(AutomatonProvider automaton_provider, int determini
    */
   public Automaton toAutomaton(Map<String, Automaton> automata, int determinizeWorkLimit)
       throws IllegalArgumentException, TooComplexToDeterminizeException {
-    return toAutomaton(automata, null, determinizeWorkLimit);
+    return toAutomaton(automata, null, determinizeWorkLimit, true);
   }
 
   private Automaton toAutomaton(
       Map<String, Automaton> automata,
       AutomatonProvider automaton_provider,
-      int determinizeWorkLimit)
+      int determinizeWorkLimit,
+      boolean buildDFA)

Review comment:
       I've got a good practical solution to this, PR is coming soon. Then RegExp gets simple and callers can determinize/minimize if they want that.

##########
File path: lucene/core/src/test/org/apache/lucene/util/automaton/TestNFARunAutomaton.java
##########
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import org.apache.lucene.util.IntsRef;
+import org.apache.lucene.util.LuceneTestCase;
+
+public class TestNFARunAutomaton extends LuceneTestCase {
+
+  public void testRandom() {
+    for (int i = 0; i < 100; i++) {
+      RegExp regExp = null;
+      while (regExp == null) {
+        try {
+          regExp = new RegExp(AutomatonTestUtil.randomRegexp(random()));
+        } catch (IllegalArgumentException e) {
+          ignoreException(e);

Review comment:
       You can also annotate the `IllegalArgumentException` with `@SuppressWarnings("unused")`.
   
   as far as the random regexp, I think it already does what you want? One tricky thing is, the strings it generates are only valid if you then build regexp with `RegExp.NONE`. That is probably what causes confusion here.

##########
File path: lucene/codecs/src/java/org/apache/lucene/codecs/memory/DirectPostingsFormat.java
##########
@@ -962,15 +964,22 @@ public ImpactsEnum impacts(int flags) throws IOException {
       private int stateUpto;
 
       public DirectIntersectTermsEnum(CompiledAutomaton compiled, BytesRef startTerm) {
-        runAutomaton = compiled.runAutomaton;
-        compiledAutomaton = compiled;
+        if (compiled.nfaRunAutomaton != null) {
+          this.runAutomaton = compiled.nfaRunAutomaton;

Review comment:
       I too would really prefer it if we can avoid the current "if". It starts to look like a "dance" in all the places where we do it. I don't understand about how it makes debugging easier, can't we just print automaton.isDeterministic() ? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] dweiss edited a comment on pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

dweiss edited a comment on pull request #225:
URL: https://github.com/apache/lucene/pull/225#issuecomment-941302842


   The stack trace is interesting though - looks like access to a closed reader pool:
   ```
   org.apache.lucene.index.TestIndexFileDeleter > testExcInDecRef FAILED
       org.apache.lucene.store.AlreadyClosedException: ReaderPool is already closed
           at __randomizedtesting.SeedInfo.seed([FD14DA9475FFAE2C:1489ADA6033649D1]:0)
           at app//org.apache.lucene.index.ReaderPool.get(ReaderPool.java:400)
           at app//org.apache.lucene.index.IndexWriter.writeReaderPool(IndexWriter.java:3742)
           at app//org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:590)
           at app//org.apache.lucene.index.RandomIndexWriter.getReader(RandomIndexWriter.java:474)
           at app//org.apache.lucene.index.RandomIndexWriter.getReader(RandomIndexWriter.java:406)
           at app//org.apache.lucene.index.TestIndexFileDeleter.testExcInDecRef(TestIndexFileDeleter.java:484)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] zhaih commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

zhaih commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r724693503



##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
##########
@@ -551,12 +551,22 @@ static RegExp newLeafNode(
     return new RegExp(flags, kind, null, null, s, c, min, max, digits, from, to);
   }
 
+  /**
+   * Return an <code>Automaton</code> from this <code>RegExp</code> that will skip the determinize
+   * and minimize step
+   *
+   * @return {@link Automaton} most likely non-deterministic
+   */
+  public Automaton toNFA() {

Review comment:
       I didn't read it carefully, but it's just using a systematic way I think, similar to what stated here: https://swtch.com/~rsc/regexp/regexp1.html?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] zhaih commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

zhaih commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r724695583



##########
File path: lucene/core/src/test/org/apache/lucene/util/automaton/TestNFARunAutomaton.java
##########
@@ -0,0 +1,188 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.Set;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.RandomIndexWriter;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.search.AutomatonQuery;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.util.IntsRef;
+import org.apache.lucene.util.LuceneTestCase;
+import org.apache.lucene.util.TestUtil;
+
+public class TestNFARunAutomaton extends LuceneTestCase {
+
+  private static final String FIELD = "field";
+
+  public void testWithRandomRegex() {
+    for (int i = 0; i < 100; i++) {
+      RegExp regExp = null;
+      while (regExp == null) {
+        try {
+          regExp = new RegExp(AutomatonTestUtil.randomRegexp(random()));
+        } catch (IllegalArgumentException e) {
+          ignoreException(e);
+        }
+      }
+      Automaton nfa = regExp.toNFA();
+      if (nfa.isDeterministic()) {
+        i--;
+        continue;
+      }
+      Automaton dfa = regExp.toAutomaton();
+      NFARunAutomaton candidate = new NFARunAutomaton(nfa);
+      AutomatonTestUtil.RandomAcceptedStrings randomStringGen;
+      try {
+        randomStringGen = new AutomatonTestUtil.RandomAcceptedStrings(dfa);
+      } catch (IllegalArgumentException e) {
+        ignoreException(e);
+        i--;
+        continue; // sometimes the automaton accept nothing and throw this exception
+      }
+
+      for (int round = 0; round < 20; round++) {
+        // test order of accepted strings and random (likely rejected) strings alternatively to make
+        // sure caching system works correctly
+        if (random().nextBoolean()) {
+          testAcceptedString(regExp, randomStringGen, candidate, 10);
+          testRandomString(regExp, dfa, candidate, 10);
+        } else {
+          testRandomString(regExp, dfa, candidate, 10);
+          testAcceptedString(regExp, randomStringGen, candidate, 10);
+        }
+      }
+    }
+  }
+
+  public void testWithRandomAutomatonQuery() throws IOException {
+    final int n = 5;
+    for (int i = 0; i < n; i++) {
+      randomAutomatonQueryTest();
+    }
+  }
+
+  private void randomAutomatonQueryTest() throws IOException {
+    final int docNum = 50;
+    final int automatonNum = 50;
+    Directory directory = newDirectory();
+    RandomIndexWriter writer = new RandomIndexWriter(random(), directory);
+
+    Set<String> vocab = new HashSet<>();
+    Set<String> perLoopReuse = new HashSet<>();
+    for (int i = 0; i < docNum; i++) {
+      perLoopReuse.clear();
+      int termNum = random().nextInt(20) + 30;
+      while (perLoopReuse.size() < termNum) {
+        String randomString;
+        while ((randomString = TestUtil.randomUnicodeString(random())).length() == 0)
+          ;
+        perLoopReuse.add(randomString);
+        vocab.add(randomString);
+      }
+      Document document = new Document();
+      document.add(
+          newTextField(
+              FIELD, perLoopReuse.stream().reduce("", (s1, s2) -> s1 + " " + s2), Field.Store.NO));
+      writer.addDocument(document);
+    }
+    writer.commit();
+    IndexReader reader = DirectoryReader.open(directory);
+    IndexSearcher searcher = new IndexSearcher(reader);
+
+    Set<String> foreignVocab = new HashSet<>();
+    while (foreignVocab.size() < vocab.size()) {
+      String randomString;
+      while ((randomString = TestUtil.randomUnicodeString(random())).length() == 0)
+        ;
+      foreignVocab.add(randomString);
+    }
+
+    ArrayList<String> vocabList = new ArrayList<>(vocab);
+    ArrayList<String> foreignVocabList = new ArrayList<>(foreignVocab);
+
+    for (int i = 0; i < automatonNum; i++) {
+      perLoopReuse.clear();
+      int termNum = random().nextInt(40) + 30;
+      while (perLoopReuse.size() < termNum) {
+        if (random().nextBoolean()) {
+          perLoopReuse.add(vocabList.get(random().nextInt(vocabList.size())));
+        } else {
+          perLoopReuse.add(foreignVocabList.get(random().nextInt(foreignVocabList.size())));
+        }
+      }
+      Automaton a = null;
+      for (String term : perLoopReuse) {
+        if (a == null) {
+          a = Automata.makeString(term);
+        } else {
+          a = Operations.union(a, Automata.makeString(term));
+        }
+      }
+      if (a.isDeterministic()) {
+        i--;
+        continue;
+      }
+      AutomatonQuery dfaQuery = new AutomatonQuery(new Term(FIELD), a);
+      AutomatonQuery nfaQuery = new AutomatonQuery(new Term(FIELD), a, ByteRunnable.TYPE.NFA);

Review comment:
       Good idea! Let's delay it a bit to the next PR maybe (I plan to introduce NFARegexQuery in next PR probably).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

rmuir commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r772607680



##########
File path: lucene/core/src/test/org/apache/lucene/util/automaton/TestNFARunAutomaton.java
##########
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import org.apache.lucene.util.IntsRef;
+import org.apache.lucene.util.LuceneTestCase;
+
+public class TestNFARunAutomaton extends LuceneTestCase {
+
+  public void testRandom() {
+    for (int i = 0; i < 100; i++) {
+      RegExp regExp = null;
+      while (regExp == null) {
+        try {
+          regExp = new RegExp(AutomatonTestUtil.randomRegexp(random()));
+        } catch (IllegalArgumentException e) {
+          ignoreException(e);

Review comment:
       It happens because they are inconsistent. Disabling/enabling syntax with `ALL` flag can cause exceptions to happen that would not happen with `NONE`. For example:
   
   Valid with `RegExp.NONE`, but not valid with `RegExp.ALL` (because flag enables numeric intervals, and it throws exception as it is an incomplete interval):
   ```
   <1-
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] zhaih commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

zhaih commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r726751739



##########
File path: lucene/core/src/java/org/apache/lucene/search/AutomatonQuery.java
##########
@@ -65,7 +66,20 @@
    * @param automaton Automaton to run, terms that are accepted are considered a match.
    */

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] zhaih commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

zhaih commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r719908188



##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/CompiledAutomaton.java
##########
@@ -250,15 +291,23 @@ public CompiledAutomaton(
       }
     }
 
-    // This will determinize the binary automaton for us:
-    runAutomaton = new ByteRunAutomaton(binary, true, determinizeWorkLimit);
+    if (automaton.isDeterministic() == false && byteRunnableType == ByteRunnable.TYPE.NFA) {

Review comment:
       Yes!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mikemccand commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

mikemccand commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r718577915



##########
File path: lucene/core/src/java/org/apache/lucene/search/AutomatonQuery.java
##########
@@ -65,7 +66,19 @@
    * @param automaton Automaton to run, terms that are accepted are considered a match.
    */
   public AutomatonQuery(final Term term, Automaton automaton) {
-    this(term, automaton, Operations.DEFAULT_DETERMINIZE_WORK_LIMIT);
+    this(term, automaton, ByteRunnable.TYPE.DFA);
+  }
+
+  /**
+   * Create a new AutomatonQuery from an {@link Automaton}. Using specific type of RunAutomaton
+   *
+   * @param term Term containing field and possibly some pattern structure. The term text is
+   *     ignored.
+   * @param automaton Automaton to run, terms that are accepted are considered a match.
+   * @param runnableType NFA or DFA

Review comment:
       Could you improve these javadocs a bit?  And specifically include a warning that `NFA` has uncertain performance impact?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mikemccand commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

mikemccand commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r725023003



##########
File path: lucene/core/src/java/org/apache/lucene/search/AutomatonQuery.java
##########
@@ -96,12 +110,36 @@ public AutomatonQuery(final Term term, Automaton automaton, int determinizeWorkL
    */
   public AutomatonQuery(
       final Term term, Automaton automaton, int determinizeWorkLimit, boolean isBinary) {
+    this(term, automaton, determinizeWorkLimit, isBinary, ByteRunnable.TYPE.DFA);
+  }
+
+  /**
+   * Create a new AutomatonQuery from an {@link Automaton}.
+   *
+   * @param term Term containing field and possibly some pattern structure. The term text is
+   *     ignored.
+   * @param automaton Automaton to run, terms that are accepted are considered a match.
+   * @param determinizeWorkLimit maximum effort to spend determinizing the automaton. If the
+   *     automaton will need more than this much effort, TooComplexToDeterminizeException is thrown.
+   *     Higher numbers require more space but can process more complex automata.
+   * @param isBinary if true, this automaton is already binary and will not go through the
+   *     UTF32ToUTF8 conversion
+   * @param runnableType NFA or DFA. See {@link org.apache.lucene.util.automaton.ByteRunnable.TYPE}
+   *     for difference between NFA and DFA. Also note * that NFA has uncertain performance impact

Review comment:
       Remove that errant `*` between `note` and `that`?

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/ByteRunnable.java
##########
@@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.automaton;
+
+/** A runnable automaton accepting byte array as input */
+public interface ByteRunnable {
+
+  /** NFA or DFA */
+  enum TYPE {
+    /**
+     * Determinize the automaton lazily on-demand as terms are intersected. This option saves the
+     * up-front determinize cost, and can handle some RegExps that DFA cannot, but intersection will
+     * be a bit slower

Review comment:
       Missing period at the end of the sentence?
   
   Maybe link to Russ Cox's famous page (https://swtch.com/~rsc/regexp/regexp1.html) and point out that this is similar to the Thompson NFA approach described there?

##########
File path: lucene/codecs/src/java/org/apache/lucene/codecs/memory/DirectPostingsFormat.java
##########
@@ -962,15 +964,22 @@ public ImpactsEnum impacts(int flags) throws IOException {
       private int stateUpto;
 
       public DirectIntersectTermsEnum(CompiledAutomaton compiled, BytesRef startTerm) {
-        runAutomaton = compiled.runAutomaton;
-        compiledAutomaton = compiled;
+        if (compiled.nfaRunAutomaton != null) {
+          this.runAutomaton = compiled.nfaRunAutomaton;

Review comment:
       OK we can wait on this.  Maybe just add a comment explaining why we need this odd `if` still, here and in the other places where we did this.

##########
File path: lucene/core/src/java/org/apache/lucene/search/AutomatonQuery.java
##########
@@ -65,7 +66,20 @@
    * @param automaton Automaton to run, terms that are accepted are considered a match.
    */

Review comment:
       Could you update these javadocs to state that the `runnableType` is `DFA` by default, and point to the `ByteRunnable.TYPE` javadocs?

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/ByteRunnable.java
##########
@@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.automaton;
+
+/** A runnable automaton accepting byte array as input */
+public interface ByteRunnable {
+
+  /** NFA or DFA */
+  enum TYPE {

Review comment:
       Could you add `@lucene.experimental` to the `TYPE` javadocs, and on each of the options (`NFA`, `DFA`)?
   
   Also, could you pull out this `enum` into its own class under `o.a.l.util.automaton`, maybe `AutomatonExecutionMode` or `AutomatonIntersectionMode` or `AutomatonExecutionStrategy` (getting longish to type...)?  I think it is too obscure living down inside this non-consumable (to Lucene users who don't know all sorts of details about automata) `ByteRunnable` now.

##########
File path: lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene40/blocktree/FieldReader.java
##########
@@ -187,6 +187,14 @@ public TermsEnum intersect(CompiledAutomaton compiled, BytesRef startTerm) throw
     if (compiled.type != CompiledAutomaton.AUTOMATON_TYPE.NORMAL) {
       throw new IllegalArgumentException("please use CompiledAutomaton.getTermsEnum instead");
     }
+    if (compiled.nfaRunAutomaton != null) {
+      return new IntersectTermsEnum(

Review comment:
       Oh, nevermind!  I got my classes confused :)  With this new `if` statement, we are in fact still using `BlockTree`'s fast intersect implementation, great!  Please disregard...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] zhaih commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

zhaih commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r726747416



##########
File path: lucene/codecs/src/java/org/apache/lucene/codecs/memory/DirectPostingsFormat.java
##########
@@ -962,15 +964,22 @@ public ImpactsEnum impacts(int flags) throws IOException {
       private int stateUpto;
 
       public DirectIntersectTermsEnum(CompiledAutomaton compiled, BytesRef startTerm) {
-        runAutomaton = compiled.runAutomaton;
-        compiledAutomaton = compiled;
+        if (compiled.nfaRunAutomaton != null) {
+          this.runAutomaton = compiled.nfaRunAutomaton;

Review comment:
       Sure




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

rmuir commented on pull request #225:
URL: https://github.com/apache/lucene/pull/225#issuecomment-981367730

> But since we are not changing the behavior of `AutomatonQuery` (it still determinizes up-front by default), I don't think we need to block pushing this (once we iterate on all feedback) on benchmark results?

Can we at least run existing benchmarks to confirm it doesn't introduce regressions? Some of the code is performance sensitive and we are adding abstractions here. damn java.

I'm still confused about the API, as it adds lots of booleans/enums. Do we really need enums, or can a simple boolean suffice? Does AutomatonQuery really need such boolean, or should it just look at `Automaton.isDeterministic` to determine what to do? Depending on what performance shows, why even keep around a both `determinizeWorkLimit` and the booleans/enums? Instead of throwing an exception, why not fall back to NFA if you really want to run some crazy regexp?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

rmuir commented on pull request #225:
URL: https://github.com/apache/lucene/pull/225#issuecomment-988465898


   ok one more, but I think it sets us up even better: https://github.com/apache/lucene/pull/528


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] zhaih commented on pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

zhaih commented on pull request #225:
URL: https://github.com/apache/lucene/pull/225#issuecomment-998323463


   Thank you @rmuir and @mikemccand for reviewing this big PR! I'll merge it myself :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mikemccand commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

mikemccand commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r715531456



##########
File path: lucene/codecs/src/java/org/apache/lucene/codecs/uniformsplit/IntersectBlockReader.java
##########
@@ -384,15 +390,18 @@ protected AutomatonNextTermCalculator(CompiledAutomaton compiled) {
     }
 
     /** Records the given state has been visited. */
-    protected void setVisited(int state) {
+    private void setVisited(int state) {
       if (!finite) {

Review comment:
       Maybe fix to `finite == false` since you are here :)

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
##########
@@ -551,12 +551,22 @@ static RegExp newLeafNode(
     return new RegExp(flags, kind, null, null, s, c, min, max, digits, from, to);
   }
 
+  /**
+   * Return an <code>Automaton</code> from this <code>RegExp</code> that will skip the determinize
+   * and minimize step
+   *
+   * @return {@link Automaton} most likely non-deterministic
+   */
+  public Automaton toNFA() {

Review comment:
       I wonder just how "NFA" this Automaton really is.  Like for a simple regexp, what does the NFA even look like?  I know the `RegExp` code makes heavy use of `.addEpsilon` which creates many copies of transitions, etc.

##########
File path: lucene/core/src/java/org/apache/lucene/search/AutomatonQuery.java
##########
@@ -96,12 +109,35 @@ public AutomatonQuery(final Term term, Automaton automaton, int determinizeWorkL
    */
   public AutomatonQuery(
       final Term term, Automaton automaton, int determinizeWorkLimit, boolean isBinary) {
+    this(term, automaton, determinizeWorkLimit, isBinary, ByteRunnable.TYPE.DFA);
+  }
+
+  /**
+   * Create a new AutomatonQuery from an {@link Automaton}.
+   *
+   * @param term Term containing field and possibly some pattern structure. The term text is
+   *     ignored.
+   * @param automaton Automaton to run, terms that are accepted are considered a match.
+   * @param determinizeWorkLimit maximum effort to spend determinizing the automaton. If the
+   *     automaton will need more than this much effort, TooComplexToDeterminizeException is thrown.
+   *     Higher numbers require more space but can process more complex automata.
+   * @param isBinary if true, this automaton is already binary and will not go through the
+   *     UTF32ToUTF8 conversion
+   * @param runnableType NFA or DFA
+   */
+  public AutomatonQuery(

Review comment:
       Cool, so the existing ctor remains, defaulting to `DFA` execution strategy, where the automaton is first fully determinized.
   
   But now you add another ctor, letting users also ask for `NFA` execution, where the automaton is determinized lazily on-demand and only in those parts that the terms in this index need to visit.

##########
File path: lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene40/blocktree/FieldReader.java
##########
@@ -187,6 +187,14 @@ public TermsEnum intersect(CompiledAutomaton compiled, BytesRef startTerm) throw
     if (compiled.type != CompiledAutomaton.AUTOMATON_TYPE.NORMAL) {
       throw new IllegalArgumentException("please use CompiledAutomaton.getTermsEnum instead");
     }
+    if (compiled.nfaRunAutomaton != null) {
+      return new IntersectTermsEnum(

Review comment:
       Ahh, so it was too difficult to support `nfaRunAutomaton` also in `BlockTree`?  This probably hurts performance quite a bit for `NFAQuery` -- `BlockTree`'s specialized `intersect` impl is fast.  But we can optimize later.

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/CompiledAutomaton.java
##########
@@ -133,7 +137,35 @@ private static int findSinkState(Automaton automaton) {
    * is one the cases in {@link CompiledAutomaton.AUTOMATON_TYPE}.
    */
   public CompiledAutomaton(Automaton automaton, Boolean finite, boolean simplify) {
-    this(automaton, finite, simplify, Operations.DEFAULT_DETERMINIZE_WORK_LIMIT, false);
+    this(automaton, finite, simplify, ByteRunnable.TYPE.DFA);

Review comment:
       Good -- the existing ctors remain and default to `DFA` strategy.

##########
File path: lucene/codecs/src/java/org/apache/lucene/codecs/memory/DirectPostingsFormat.java
##########
@@ -962,15 +964,22 @@ public ImpactsEnum impacts(int flags) throws IOException {
       private int stateUpto;
 
       public DirectIntersectTermsEnum(CompiledAutomaton compiled, BytesRef startTerm) {
-        runAutomaton = compiled.runAutomaton;
-        compiledAutomaton = compiled;
+        if (compiled.nfaRunAutomaton != null) {
+          this.runAutomaton = compiled.nfaRunAutomaton;

Review comment:
       Maybe instead of having separate `nfaRunAutomaton` and `runAutomaton` we could have only `runAutomaton` and a separate `CompiledAutomaton.isDeterminized` boolean?

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/CompiledAutomaton.java
##########
@@ -250,15 +291,23 @@ public CompiledAutomaton(
       }
     }
 
-    // This will determinize the binary automaton for us:
-    runAutomaton = new ByteRunAutomaton(binary, true, determinizeWorkLimit);
+    if (automaton.isDeterministic() == false && byteRunnableType == ByteRunnable.TYPE.NFA) {

Review comment:
       Are we still pulling the common prefix/suffix even in `NFA` mode?  @rmuir recently improved those operations to not require a determinized automaton.

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/ByteRunnable.java
##########
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.automaton;
+
+/** A runnable automaton accepting byte array as input */
+public interface ByteRunnable {
+
+  /** NFA or DFA */
+  enum TYPE {
+    /** use NFARunAutomaton */

Review comment:
       Can we improve these javadocs?  Instead of referring to internal classes, let's write it as seen from a somewhat less knowledgeable external future user.
   
   E.g. for `DFA`, something like `Fully determinize the automaton up-front for fast term intersection.  Some RegExps may fail to determinize, throwing TooComplexToDeterminizeException.  But if they do not, intersection is fast.`, and for `NFA`, something like `Determinize the automaton lazily on-demand as terms are intersected.  This option saves the up-front determinize cost, and can handle some RegExps that DFA cannot, but intersection will be a bit slower`?

##########
File path: lucene/core/src/test/org/apache/lucene/util/automaton/TestNFARunAutomaton.java
##########
@@ -0,0 +1,188 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.Set;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.RandomIndexWriter;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.search.AutomatonQuery;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.util.IntsRef;
+import org.apache.lucene.util.LuceneTestCase;
+import org.apache.lucene.util.TestUtil;
+
+public class TestNFARunAutomaton extends LuceneTestCase {
+
+  private static final String FIELD = "field";
+
+  public void testWithRandomRegex() {
+    for (int i = 0; i < 100; i++) {
+      RegExp regExp = null;
+      while (regExp == null) {
+        try {
+          regExp = new RegExp(AutomatonTestUtil.randomRegexp(random()));
+        } catch (IllegalArgumentException e) {
+          ignoreException(e);
+        }
+      }
+      Automaton nfa = regExp.toNFA();
+      if (nfa.isDeterministic()) {
+        i--;
+        continue;
+      }
+      Automaton dfa = regExp.toAutomaton();
+      NFARunAutomaton candidate = new NFARunAutomaton(nfa);
+      AutomatonTestUtil.RandomAcceptedStrings randomStringGen;
+      try {
+        randomStringGen = new AutomatonTestUtil.RandomAcceptedStrings(dfa);
+      } catch (IllegalArgumentException e) {
+        ignoreException(e);
+        i--;
+        continue; // sometimes the automaton accept nothing and throw this exception
+      }
+
+      for (int round = 0; round < 20; round++) {
+        // test order of accepted strings and random (likely rejected) strings alternatively to make
+        // sure caching system works correctly
+        if (random().nextBoolean()) {
+          testAcceptedString(regExp, randomStringGen, candidate, 10);
+          testRandomString(regExp, dfa, candidate, 10);
+        } else {
+          testRandomString(regExp, dfa, candidate, 10);
+          testAcceptedString(regExp, randomStringGen, candidate, 10);
+        }
+      }
+    }
+  }
+
+  public void testWithRandomAutomatonQuery() throws IOException {
+    final int n = 5;
+    for (int i = 0; i < n; i++) {
+      randomAutomatonQueryTest();
+    }
+  }
+
+  private void randomAutomatonQueryTest() throws IOException {
+    final int docNum = 50;
+    final int automatonNum = 50;
+    Directory directory = newDirectory();
+    RandomIndexWriter writer = new RandomIndexWriter(random(), directory);
+
+    Set<String> vocab = new HashSet<>();
+    Set<String> perLoopReuse = new HashSet<>();
+    for (int i = 0; i < docNum; i++) {
+      perLoopReuse.clear();
+      int termNum = random().nextInt(20) + 30;
+      while (perLoopReuse.size() < termNum) {
+        String randomString;
+        while ((randomString = TestUtil.randomUnicodeString(random())).length() == 0)
+          ;
+        perLoopReuse.add(randomString);
+        vocab.add(randomString);
+      }
+      Document document = new Document();
+      document.add(
+          newTextField(
+              FIELD, perLoopReuse.stream().reduce("", (s1, s2) -> s1 + " " + s2), Field.Store.NO));
+      writer.addDocument(document);
+    }
+    writer.commit();
+    IndexReader reader = DirectoryReader.open(directory);
+    IndexSearcher searcher = new IndexSearcher(reader);
+
+    Set<String> foreignVocab = new HashSet<>();
+    while (foreignVocab.size() < vocab.size()) {
+      String randomString;
+      while ((randomString = TestUtil.randomUnicodeString(random())).length() == 0)
+        ;
+      foreignVocab.add(randomString);
+    }
+
+    ArrayList<String> vocabList = new ArrayList<>(vocab);
+    ArrayList<String> foreignVocabList = new ArrayList<>(foreignVocab);
+
+    for (int i = 0; i < automatonNum; i++) {
+      perLoopReuse.clear();

Review comment:
       And then make a new (still reused) `HashSet` here, `perQueryVocab`?

##########
File path: lucene/core/src/test/org/apache/lucene/util/automaton/TestNFARunAutomaton.java
##########
@@ -0,0 +1,188 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.Set;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.RandomIndexWriter;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.search.AutomatonQuery;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.util.IntsRef;
+import org.apache.lucene.util.LuceneTestCase;
+import org.apache.lucene.util.TestUtil;
+
+public class TestNFARunAutomaton extends LuceneTestCase {
+
+  private static final String FIELD = "field";
+
+  public void testWithRandomRegex() {
+    for (int i = 0; i < 100; i++) {
+      RegExp regExp = null;
+      while (regExp == null) {
+        try {
+          regExp = new RegExp(AutomatonTestUtil.randomRegexp(random()));
+        } catch (IllegalArgumentException e) {
+          ignoreException(e);
+        }
+      }
+      Automaton nfa = regExp.toNFA();
+      if (nfa.isDeterministic()) {
+        i--;
+        continue;
+      }
+      Automaton dfa = regExp.toAutomaton();
+      NFARunAutomaton candidate = new NFARunAutomaton(nfa);
+      AutomatonTestUtil.RandomAcceptedStrings randomStringGen;
+      try {
+        randomStringGen = new AutomatonTestUtil.RandomAcceptedStrings(dfa);
+      } catch (IllegalArgumentException e) {
+        ignoreException(e);
+        i--;
+        continue; // sometimes the automaton accept nothing and throw this exception
+      }
+
+      for (int round = 0; round < 20; round++) {
+        // test order of accepted strings and random (likely rejected) strings alternatively to make
+        // sure caching system works correctly
+        if (random().nextBoolean()) {
+          testAcceptedString(regExp, randomStringGen, candidate, 10);
+          testRandomString(regExp, dfa, candidate, 10);
+        } else {
+          testRandomString(regExp, dfa, candidate, 10);
+          testAcceptedString(regExp, randomStringGen, candidate, 10);
+        }
+      }
+    }
+  }
+
+  public void testWithRandomAutomatonQuery() throws IOException {
+    final int n = 5;
+    for (int i = 0; i < n; i++) {
+      randomAutomatonQueryTest();
+    }
+  }
+
+  private void randomAutomatonQueryTest() throws IOException {
+    final int docNum = 50;
+    final int automatonNum = 50;
+    Directory directory = newDirectory();
+    RandomIndexWriter writer = new RandomIndexWriter(random(), directory);
+
+    Set<String> vocab = new HashSet<>();
+    Set<String> perLoopReuse = new HashSet<>();
+    for (int i = 0; i < docNum; i++) {
+      perLoopReuse.clear();
+      int termNum = random().nextInt(20) + 30;
+      while (perLoopReuse.size() < termNum) {
+        String randomString;
+        while ((randomString = TestUtil.randomUnicodeString(random())).length() == 0)
+          ;
+        perLoopReuse.add(randomString);
+        vocab.add(randomString);
+      }
+      Document document = new Document();
+      document.add(
+          newTextField(
+              FIELD, perLoopReuse.stream().reduce("", (s1, s2) -> s1 + " " + s2), Field.Store.NO));
+      writer.addDocument(document);
+    }
+    writer.commit();
+    IndexReader reader = DirectoryReader.open(directory);
+    IndexSearcher searcher = new IndexSearcher(reader);
+
+    Set<String> foreignVocab = new HashSet<>();
+    while (foreignVocab.size() < vocab.size()) {
+      String randomString;
+      while ((randomString = TestUtil.randomUnicodeString(random())).length() == 0)
+        ;
+      foreignVocab.add(randomString);
+    }
+
+    ArrayList<String> vocabList = new ArrayList<>(vocab);
+    ArrayList<String> foreignVocabList = new ArrayList<>(foreignVocab);
+
+    for (int i = 0; i < automatonNum; i++) {
+      perLoopReuse.clear();
+      int termNum = random().nextInt(40) + 30;
+      while (perLoopReuse.size() < termNum) {
+        if (random().nextBoolean()) {
+          perLoopReuse.add(vocabList.get(random().nextInt(vocabList.size())));
+        } else {
+          perLoopReuse.add(foreignVocabList.get(random().nextInt(foreignVocabList.size())));
+        }
+      }
+      Automaton a = null;
+      for (String term : perLoopReuse) {
+        if (a == null) {
+          a = Automata.makeString(term);
+        } else {
+          a = Operations.union(a, Automata.makeString(term));
+        }
+      }
+      if (a.isDeterministic()) {
+        i--;
+        continue;
+      }
+      AutomatonQuery dfaQuery = new AutomatonQuery(new Term(FIELD), a);
+      AutomatonQuery nfaQuery = new AutomatonQuery(new Term(FIELD), a, ByteRunnable.TYPE.NFA);

Review comment:
       Could you add a new `LuceneTestCase` method, `newAutomatonQuery`, and it would randomly pick between `NFA` and `DFA` type?  And then in a few pre-existing tests, let's call `newAutomatonQuery` instead of `new AutomatonQuery`?
   
   That method could also randomly make the automaton non-deterministic by simple cloning a few states?  We can do this in a follow-on issue.

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##########
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA

Review comment:
       Period after `DFA`.
   
   And maybe say `It will lazily determinize on-demand, memorizing the generated DFA states that indexed terms have intersected with`.

##########
File path: lucene/core/src/test/org/apache/lucene/util/automaton/TestNFARunAutomaton.java
##########
@@ -0,0 +1,188 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.Set;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.RandomIndexWriter;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.search.AutomatonQuery;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.util.IntsRef;
+import org.apache.lucene.util.LuceneTestCase;
+import org.apache.lucene.util.TestUtil;
+
+public class TestNFARunAutomaton extends LuceneTestCase {
+
+  private static final String FIELD = "field";
+
+  public void testWithRandomRegex() {
+    for (int i = 0; i < 100; i++) {
+      RegExp regExp = null;
+      while (regExp == null) {
+        try {
+          regExp = new RegExp(AutomatonTestUtil.randomRegexp(random()));
+        } catch (IllegalArgumentException e) {
+          ignoreException(e);
+        }
+      }
+      Automaton nfa = regExp.toNFA();
+      if (nfa.isDeterministic()) {
+        i--;
+        continue;
+      }
+      Automaton dfa = regExp.toAutomaton();
+      NFARunAutomaton candidate = new NFARunAutomaton(nfa);
+      AutomatonTestUtil.RandomAcceptedStrings randomStringGen;
+      try {
+        randomStringGen = new AutomatonTestUtil.RandomAcceptedStrings(dfa);
+      } catch (IllegalArgumentException e) {
+        ignoreException(e);
+        i--;
+        continue; // sometimes the automaton accept nothing and throw this exception
+      }
+
+      for (int round = 0; round < 20; round++) {
+        // test order of accepted strings and random (likely rejected) strings alternatively to make
+        // sure caching system works correctly
+        if (random().nextBoolean()) {
+          testAcceptedString(regExp, randomStringGen, candidate, 10);
+          testRandomString(regExp, dfa, candidate, 10);
+        } else {
+          testRandomString(regExp, dfa, candidate, 10);
+          testAcceptedString(regExp, randomStringGen, candidate, 10);
+        }
+      }
+    }
+  }
+
+  public void testWithRandomAutomatonQuery() throws IOException {
+    final int n = 5;
+    for (int i = 0; i < n; i++) {
+      randomAutomatonQueryTest();
+    }
+  }
+
+  private void randomAutomatonQueryTest() throws IOException {
+    final int docNum = 50;
+    final int automatonNum = 50;
+    Directory directory = newDirectory();
+    RandomIndexWriter writer = new RandomIndexWriter(random(), directory);
+
+    Set<String> vocab = new HashSet<>();
+    Set<String> perLoopReuse = new HashSet<>();

Review comment:
       Maybe rename to `perDocVocab`?

##########
File path: lucene/core/src/java/org/apache/lucene/search/AutomatonQuery.java
##########
@@ -65,7 +66,19 @@
    * @param automaton Automaton to run, terms that are accepted are considered a match.
    */
   public AutomatonQuery(final Term term, Automaton automaton) {
-    this(term, automaton, Operations.DEFAULT_DETERMINIZE_WORK_LIMIT);
+    this(term, automaton, ByteRunnable.TYPE.DFA);
+  }
+
+  /**
+   * Create a new AutomatonQuery from an {@link Automaton}. Using specific type of RunAutomaton
+   *
+   * @param term Term containing field and possibly some pattern structure. The term text is
+   *     ignored.
+   * @param automaton Automaton to run, terms that are accepted are considered a match.
+   * @param runnableType NFA or DFA

Review comment:
       Could you improve these javadocs a bit?  And specifically include a warning that `NFA` has uncertain performance impact?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] dweiss commented on pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

dweiss commented on pull request #225:
URL: https://github.com/apache/lucene/pull/225#issuecomment-941302842






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] zhaih commented on pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

zhaih commented on pull request #225:
URL: https://github.com/apache/lucene/pull/225#issuecomment-942835409


   Thanks @dweiss seems this is not the first time we see this error: https://issues.apache.org/jira/browse/LUCENE-9839


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] zhaih commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

zhaih commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r724613434



##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/ByteRunnable.java
##########
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.automaton;
+
+/** A runnable automaton accepting byte array as input */
+public interface ByteRunnable {
+
+  /** NFA or DFA */
+  enum TYPE {
+    /** use NFARunAutomaton */

Review comment:
       Sure!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] zhaih commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

zhaih commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r766061920



##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
##########
@@ -614,16 +624,17 @@ public Automaton toAutomaton(AutomatonProvider automaton_provider, int determini
    */
   public Automaton toAutomaton(Map<String, Automaton> automata, int determinizeWorkLimit)
       throws IllegalArgumentException, TooComplexToDeterminizeException {
-    return toAutomaton(automata, null, determinizeWorkLimit);
+    return toAutomaton(automata, null, determinizeWorkLimit, true);
   }
 
   private Automaton toAutomaton(
       Map<String, Automaton> automata,
       AutomatonProvider automaton_provider,
-      int determinizeWorkLimit)
+      int determinizeWorkLimit,
+      boolean buildDFA)

Review comment:
       Hmmm I think this is gone after I rebased to mainline? Maybe refresh the browser since I think this change you're referring to is outdated?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] zhaih commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

zhaih commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r772594287



##########
File path: lucene/core/src/test/org/apache/lucene/util/automaton/TestNFARunAutomaton.java
##########
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import org.apache.lucene.util.IntsRef;
+import org.apache.lucene.util.LuceneTestCase;
+
+public class TestNFARunAutomaton extends LuceneTestCase {
+
+  public void testRandom() {
+    for (int i = 0; i < 100; i++) {
+      RegExp regExp = null;
+      while (regExp == null) {
+        try {
+          regExp = new RegExp(AutomatonTestUtil.randomRegexp(random()));
+        } catch (IllegalArgumentException e) {
+          ignoreException(e);

Review comment:
       Yes changing it to use `RegExp.NONE` does not throw exceptions for 5 consecutive run for me, but I'm confused, as by default `RegExp`'s constructor is defaulting to use `ALL` flag, which according to javadoc is "enable all syntax` while `NONE` is "enable none additional syntax". Then why "enabling all syntax" will throw exception but "none" will not?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] zhaih commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

zhaih commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r679509373



##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##########
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA
+ * state along with the run
+ *
+ * <p>implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map<DState, Integer> dStateToOrd = new HashMap<>(); // could init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+
+  /**
+   * Constructor, assuming alphabet size is the whole codepoint space
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} for
+   *     better efficiency
+   */
+  public NFARunAutomaton(Automaton automaton) {
+    this(automaton, Character.MAX_CODE_POINT);
+  }
+
+  /**
+   * Constructor
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} *
+   *     for better efficiency
+   * @param alphabetSize alphabet size
+   */
+  public NFARunAutomaton(Automaton automaton, int alphabetSize) {
+    this.automaton = automaton;
+    points = automaton.getStartPoints();
+    this.alphabetSize = alphabetSize;
+    dStates = new DState[10];
+    findDState(new DState(new int[] {0}));
+  }
+
+  /**
+   * For a given state and an incoming character (codepoint), return the next state
+   *
+   * @param state incoming state, should either be 0 or some state that is returned previously by
+   *     this function
+   * @param c codepoint
+   * @return the next state or {@link #MISSING} if the transition doesn't exist
+   */
+  public int step(int state, int c) {
+    assert dStates[state] != null;
+    return step(dStates[state], c);
+  }
+
+  /**
+   * Run through a given codepoint array, return accepted or not, should only be used in test
+   *
+   * @param s String represented by an int array
+   * @return accept or not
+   */
+  boolean run(int[] s) {
+    int p = 0;
+    for (int c : s) {
+      p = step(p, c);
+      if (p == MISSING) return false;
+    }
+    return dStates[p].isAccept;
+  }
+
+  /**
+   * From an existing DFA state, step to next DFA state given character c if the transition is
+   * previously tried then this operation will just use the cached result, otherwise it will call
+   * {@link #step(int[], int)} to get the next state and cache the result
+   */
+  private int step(DState dState, int c) {
+    int charClass = getCharClass(c);
+    if (dState.nextState(charClass) == NOT_COMPUTED) {
+      // the next dfa state has not been computed yet
+      dState.setNextState(charClass, findDState(step(dState.nfaStates, c)));
+    }
+    return dState.nextState(charClass);
+  }
+
+  /**
+   * given a list of NFA states and a character c, compute the output list of NFA state which is
+   * wrapped as a DFA state
+   */
+  private DState step(int[] nfaStates, int c) {
+    Transition transition = new Transition();
+    StateSet stateSet = new StateSet(5); // fork IntHashSet from hppc instead?
+    int numTransitions;
+    for (int nfaState : nfaStates) {
+      numTransitions = automaton.initTransition(nfaState, transition);
+      for (int i = 0; i < numTransitions; i++) {
+        automaton.getNextTransition(transition);
+        if (transition.min <= c && transition.max >= c) {
+          stateSet.incr(transition.dest);
+        }
+      }
+    }
+    if (stateSet.size() == 0) {
+      return null;
+    }
+    return new DState(stateSet.getArray());
+  }
+
+  /**
+   * return the ordinal of given DFA state, generate a new ordinal if the given DFA state is a new
+   * one
+   */
+  private int findDState(DState dState) {
+    if (dState == null) {
+      return MISSING;
+    }
+    int ord = dStateToOrd.getOrDefault(dState, -1);
+    if (ord >= 0) {
+      return ord;
+    }
+    ord = dStateToOrd.size();
+    dStateToOrd.put(dState, ord);
+    assert ord >= dStates.length || dStates[ord] == null;
+    if (ord >= dStates.length) {
+      dStates = ArrayUtil.grow(dStates, ord + 1);
+    }
+    dStates[ord] = dState;
+    return ord;
+  }
+
+  /** Gets character class of given codepoint */
+  final int getCharClass(int c) {
+    assert c < alphabetSize;
+    // binary search
+    int a = 0;
+    int b = points.length;
+    while (b - a > 1) {
+      int d = (a + b) >>> 1;
+      if (points[d] > c) b = d;
+      else if (points[d] < c) a = d;
+      else return d;
+    }
+    return a;
+  }
+
+  private class DState {
+    private final int[] nfaStates;
+    private int[] transitions;
+    private final int hashCode;
+    private final boolean isAccept;
+
+    private DState(int[] nfaStates) {
+      assert nfaStates != null && nfaStates.length > 0;
+      this.nfaStates = nfaStates;
+      int hashCode = nfaStates.length;
+      boolean isAccept = false;
+      for (int s : nfaStates) {
+        hashCode += BitMixer.mix(s);
+        if (automaton.isAccept(s)) {
+          isAccept = true;
+        }
+      }
+      this.isAccept = isAccept;
+      this.hashCode = hashCode;
+    }
+
+    private int nextState(int charClass) {
+      initTransitions();
+      assert charClass < transitions.length;
+      return transitions[charClass];
+    }
+
+    private void setNextState(int charClass, int nextState) {
+      initTransitions();
+      assert charClass < transitions.length;
+      transitions[charClass] = nextState;
+    }
+
+    private void initTransitions() {
+      if (transitions == null) {
+        transitions = new int[points.length];

Review comment:
       Right, I think this way trade memory for a faster classification of incoming characters.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mikemccand commented on pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

mikemccand commented on pull request #225:
URL: https://github.com/apache/lucene/pull/225#issuecomment-888278654


   > I suggest, please let's not try to "overshare" and refactor all this stuff alongside DFA stuff until there is a query we can actually benchmark to see if the performance is even viable
   
   OK yeah +1 to keep it wholly separate (full fork) for now until we learn more how this `NFARegexpQuery` behaves.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

rmuir commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r678255582



##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##########
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA
+ * state along with the run
+ *
+ * <p>implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map<DState, Integer> dStateToOrd = new HashMap<>(); // could init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+
+  /**
+   * Constructor, assuming alphabet size is the whole codepoint space
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} for
+   *     better efficiency
+   */
+  public NFARunAutomaton(Automaton automaton) {
+    this(automaton, Character.MAX_CODE_POINT);
+  }
+
+  /**
+   * Constructor
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} *
+   *     for better efficiency
+   * @param alphabetSize alphabet size
+   */
+  public NFARunAutomaton(Automaton automaton, int alphabetSize) {
+    this.automaton = automaton;
+    points = automaton.getStartPoints();
+    this.alphabetSize = alphabetSize;
+    dStates = new DState[10];
+    findDState(new DState(new int[] {0}));
+  }
+
+  /**
+   * For a given state and an incoming character (codepoint), return the next state
+   *
+   * @param state incoming state, should either be 0 or some state that is returned previously by
+   *     this function
+   * @param c codepoint
+   * @return the next state or {@link #MISSING} if the transition doesn't exist
+   */
+  public int step(int state, int c) {
+    assert dStates[state] != null;
+    return step(dStates[state], c);
+  }
+
+  /**
+   * Run through a given codepoint array, return accepted or not, should only be used in test
+   *
+   * @param s String represented by an int array
+   * @return accept or not
+   */
+  boolean run(int[] s) {

Review comment:
       see my comment: we should avoid oversharing for now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mikemccand commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

mikemccand commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r725015736



##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
##########
@@ -551,12 +551,22 @@ static RegExp newLeafNode(
     return new RegExp(flags, kind, null, null, s, c, min, max, digits, from, to);
   }
 
+  /**
+   * Return an <code>Automaton</code> from this <code>RegExp</code> that will skip the determinize
+   * and minimize step
+   *
+   * @return {@link Automaton} most likely non-deterministic
+   */
+  public Automaton toNFA() {

Review comment:
       Yes, I think that's right!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] zhaih commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

zhaih commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r726749819



##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/ByteRunnable.java
##########
@@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.automaton;
+
+/** A runnable automaton accepting byte array as input */
+public interface ByteRunnable {
+
+  /** NFA or DFA */
+  enum TYPE {

Review comment:
       Good idea, I renamed it to `RunAutomatonMode`. How's that sounds like?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mikemccand commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

mikemccand commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r753339619



##########
File path: lucene/CHANGES.txt
##########
@@ -7,6 +7,8 @@ http://s.apache.org/luceneversions
 
 New Features
 
+* LUCENE-10010 Introduce NFARunAutomaton to run NFA directly. (Haoyu Zhai)

Review comment:
       Maybe change to `Patrick Zhai` for consistency?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] sonatype-lift[bot] commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

sonatype-lift[bot] commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r761596350



##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##########
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA. It will lazily determinize on-demand, memorizing the
+ * generated DFA states that has been explored
+ *
+ * <p>implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton implements ByteRunnable, TransitionAccessor {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map<DState, Integer> dStateToOrd = new HashMap<>(); // could init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+  final int[] classmap; // map from char number to class
+
+  private final Operations.PointTransitionSet transitionSet =
+      new Operations.PointTransitionSet(); // reusable
+  private final StateSet statesSet = new StateSet(5); // reusable
+
+  /**
+   * Constructor, assuming alphabet size is the whole Unicode code point space
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} for
+   *     better efficiency
+   */
+  public NFARunAutomaton(Automaton automaton) {
+    this(automaton, Character.MAX_CODE_POINT);
+  }
+
+  /**
+   * Constructor
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} *
+   *     for better efficiency
+   * @param alphabetSize alphabet size
+   */
+  public NFARunAutomaton(Automaton automaton, int alphabetSize) {
+    this.automaton = automaton;
+    points = automaton.getStartPoints();
+    this.alphabetSize = alphabetSize;
+    dStates = new DState[10];
+    findDState(new DState(new int[] {0}));
+
+    /*
+     * Set alphabet table for optimal run performance.
+     */
+    classmap = new int[Math.min(256, alphabetSize)];
+    int i = 0;
+    for (int j = 0; j < classmap.length; j++) {
+      if (i + 1 < points.length && j == points[i + 1]) {
+        i++;
+      }
+      classmap[j] = i;
+    }
+  }
+
+  /**
+   * For a given state and an incoming character (codepoint), return the next state
+   *
+   * @param state incoming state, should either be 0 or some state that is returned previously by
+   *     this function
+   * @param c codepoint
+   * @return the next state or {@link #MISSING} if the transition doesn't exist
+   */
+  @Override
+  public int step(int state, int c) {
+    assert dStates[state] != null;
+    return step(dStates[state], c);
+  }
+
+  @Override
+  public boolean isAccept(int state) {
+    assert dStates[state] != null;
+    return dStates[state].isAccept;
+  }
+
+  @Override
+  public int getSize() {
+    return dStates.length;
+  }
+
+  /**
+   * Run through a given codepoint array, return accepted or not, should only be used in test
+   *
+   * @param s String represented by an int array
+   * @return accept or not
+   */
+  boolean run(int[] s) {
+    int p = 0;
+    for (int c : s) {
+      p = step(p, c);
+      if (p == MISSING) return false;
+    }
+    return dStates[p].isAccept;
+  }
+
+  /**
+   * From an existing DFA state, step to next DFA state given character c if the transition is
+   * previously tried then this operation will just use the cached result, otherwise it will call
+   * {@link DState#step(int)} to get the next state and cache the result
+   */
+  private int step(DState dState, int c) {
+    int charClass = getCharClass(c);
+    return dState.nextState(charClass);
+  }
+
+  /**
+   * return the ordinal of given DFA state, generate a new ordinal if the given DFA state is a new
+   * one
+   */
+  private int findDState(DState dState) {
+    if (dState == null) {
+      return MISSING;
+    }
+    int ord = dStateToOrd.getOrDefault(dState, -1);
+    if (ord >= 0) {
+      return ord;
+    }
+    ord = dStateToOrd.size();
+    dStateToOrd.put(dState, ord);
+    assert ord >= dStates.length || dStates[ord] == null;
+    if (ord >= dStates.length) {
+      dStates = ArrayUtil.grow(dStates, ord + 1);
+    }
+    dStates[ord] = dState;
+    return ord;
+  }
+
+  /** Gets character class of given codepoint */
+  final int getCharClass(int c) {
+    assert c < alphabetSize;
+
+    if (c < classmap.length) {
+      return classmap[c];
+    }
+
+    // binary search
+    int a = 0;
+    int b = points.length;
+    while (b - a > 1) {
+      int d = (a + b) >>> 1;
+      if (points[d] > c) b = d;
+      else if (points[d] < c) a = d;
+      else return d;
+    }
+    return a;
+  }
+
+  @Override
+  public int initTransition(int state, Transition t) {
+    t.source = state;
+    t.transitionUpto = -1;
+    return getNumTransitions(state);
+  }
+
+  @Override
+  public void getNextTransition(Transition t) {
+    assert t.transitionUpto < points.length - 1 && t.transitionUpto >= -1;
+    while (dStates[t.source].transitions[++t.transitionUpto] == MISSING) {
+      // this shouldn't throw AIOOBE as long as this function is only called
+      // numTransitions times
+    }
+    assert dStates[t.source].transitions[t.transitionUpto] != NOT_COMPUTED;
+    t.dest = dStates[t.source].transitions[t.transitionUpto];
+
+    t.min = points[t.transitionUpto];
+    if (t.transitionUpto == points.length - 1) {
+      t.max = alphabetSize - 1;
+    } else {
+      t.max = points[t.transitionUpto + 1] - 1;
+    }
+  }
+
+  @Override
+  public int getNumTransitions(int state) {
+    dStates[state].determinize();
+    return dStates[state].outgoingTransitions;
+  }
+
+  @Override
+  public void getTransition(int state, int index, Transition t) {
+    dStates[state].determinize();
+    int outgoingTransitions = -1;
+    t.transitionUpto = -1;
+    t.source = state;
+    while (outgoingTransitions < index && t.transitionUpto < points.length - 1) {
+      if (dStates[t.source].transitions[++t.transitionUpto] != MISSING) {
+        outgoingTransitions++;
+      }
+    }
+    assert outgoingTransitions == index;
+
+    t.min = points[t.transitionUpto];
+    if (t.transitionUpto == points.length - 1) {
+      t.max = alphabetSize - 1;
+    } else {
+      t.max = points[t.transitionUpto + 1] - 1;
+    }
+  }
+
+  private class DState {
+    private final int[] nfaStates;
+    // this field is lazily init'd when first time caller wants to add a new transition
+    private int[] transitions;
+    private final int hashCode;
+    private final boolean isAccept;
+    private final Transition stepTransition = new Transition();
+    private Transition minimalTransition;
+    private int computedTransitions;
+    private int outgoingTransitions;
+
+    private DState(int[] nfaStates) {
+      assert nfaStates != null && nfaStates.length > 0;
+      this.nfaStates = nfaStates;
+      int hashCode = nfaStates.length;
+      boolean isAccept = false;
+      for (int s : nfaStates) {
+        hashCode += BitMixer.mix(s);
+        if (automaton.isAccept(s)) {
+          isAccept = true;
+        }
+      }
+      this.isAccept = isAccept;
+      this.hashCode = hashCode;
+    }
+
+    private int nextState(int charClass) {
+      initTransitions();
+      assert charClass < transitions.length;
+      if (transitions[charClass] == NOT_COMPUTED) {
+        assignTransition(charClass, findDState(step(points[charClass])));
+        // we could potentially update more than one char classes
+        if (minimalTransition != null) {
+          // to the left
+          int cls = charClass;
+          while (cls > 0 && points[--cls] >= minimalTransition.min) {
+            assert transitions[cls] == NOT_COMPUTED || transitions[cls] == transitions[charClass];
+            assignTransition(cls, transitions[charClass]);
+          }
+          // to the right
+          cls = charClass;
+          while (cls < points.length - 1 && points[++cls] <= minimalTransition.max) {
+            assert transitions[cls] == NOT_COMPUTED || transitions[cls] == transitions[charClass];
+            assignTransition(cls, transitions[charClass]);
+          }
+          minimalTransition = null;
+        }
+      }
+      return transitions[charClass];
+    }
+
+    private void assignTransition(int charClass, int dest) {
+      if (transitions[charClass] == NOT_COMPUTED) {
+        computedTransitions++;
+        transitions[charClass] = dest;
+        if (transitions[charClass] != MISSING) {
+          outgoingTransitions++;
+        }
+      }
+    }
+
+    /**
+     * given a list of NFA states and a character c, compute the output list of NFA state which is
+     * wrapped as a DFA state
+     */
+    private DState step(int c) {
+      statesSet.reset(); // TODO: fork IntHashSet from hppc instead?
+      int numTransitions;
+      int left = -1, right = alphabetSize;
+      for (int nfaState : nfaStates) {
+        numTransitions = automaton.initTransition(nfaState, stepTransition);
+        // TODO: binary search should be faster, since transitions are sorted
+        for (int i = 0; i < numTransitions; i++) {
+          automaton.getNextTransition(stepTransition);
+          if (stepTransition.min <= c && stepTransition.max >= c) {
+            statesSet.incr(stepTransition.dest);
+            left = Math.max(stepTransition.min, left);
+            right = Math.min(stepTransition.max, right);
+          }
+          if (stepTransition.max < c) {
+            left = Math.max(stepTransition.max + 1, left);
+          }
+          if (stepTransition.min > c) {
+            right = Math.min(stepTransition.min - 1, right);
+            // transitions in automaton are sorted
+            break;
+          }
+        }
+      }
+      if (statesSet.size() == 0) {
+        return null;
+      }
+      minimalTransition = new Transition();
+      minimalTransition.min = left;
+      minimalTransition.max = right;
+      return new DState(statesSet.getArray());
+    }
+
+    // determinize this state only
+    private void determinize() {
+      if (transitions != null && computedTransitions == transitions.length) {
+        // already determinized
+        return;
+      }
+      initTransitions();
+      // Mostly forked from Operations.determinize
+      transitionSet.reset();
+      for (int nfaState : nfaStates) {
+        int numTransitions = automaton.initTransition(nfaState, stepTransition);
+        for (int i = 0; i < numTransitions; i++) {
+          automaton.getNextTransition(stepTransition);
+          transitionSet.add(stepTransition);
+        }
+      }
+      if (transitionSet.count == 0) {
+        // no outgoing transitions
+        Arrays.fill(transitions, MISSING);
+        computedTransitions = transitions.length;
+        return;
+      }
+
+      transitionSet
+          .sort(); // TODO: could use a PQ (heap) instead, since transitions for each state are
+      // sorted
+      statesSet.reset();
+      int lastPoint = -1;
+      int charClass = 0;
+      for (int i = 0; i < transitionSet.count; i++) {
+        final int point = transitionSet.points[i].point;
+        if (statesSet.size() > 0) {
+          assert lastPoint != -1;
+          int ord = findDState(new DState(statesSet.getArray()));
+          while (points[charClass] < lastPoint) {
+            assignTransition(charClass++, MISSING);
+          }
+          assert points[charClass] == lastPoint;
+          while (charClass < points.length && points[charClass] < point) {
+            assert transitions[charClass] == NOT_COMPUTED || transitions[charClass] == ord;
+            assignTransition(charClass++, ord);
+          }
+          assert (charClass == points.length && point == alphabetSize)
+              || points[charClass] == point;
+        }
+
+        // process transitions that end on this point
+        // (closes an overlapping interval)
+        int[] transitions = transitionSet.points[i].ends.transitions;
+        int limit = transitionSet.points[i].ends.next;
+        for (int j = 0; j < limit; j += 3) {
+          int dest = transitions[j];
+          statesSet.decr(dest);
+        }
+        transitionSet.points[i].ends.next = 0;
+
+        // process transitions that start on this point
+        // (opens a new interval)
+        transitions = transitionSet.points[i].starts.transitions;
+        limit = transitionSet.points[i].starts.next;
+        for (int j = 0; j < limit; j += 3) {
+          int dest = transitions[j];
+          statesSet.incr(dest);
+        }
+
+        lastPoint = point;
+        transitionSet.points[i].starts.next = 0;
+      }
+      assert statesSet.size() == 0;
+      assert computedTransitions
+          >= charClass; // it's also possible that some transitions after the charClass has already
+      // been explored
+      // no more outgoing transitions, set rest of transition to MISSING
+      assert charClass == transitions.length
+          || transitions[charClass] == MISSING
+          || transitions[charClass] == NOT_COMPUTED;
+      Arrays.fill(transitions, charClass, transitions.length, MISSING);
+      computedTransitions = transitions.length;
+    }
+
+    private void initTransitions() {
+      if (transitions == null) {
+        transitions = new int[points.length];
+        Arrays.fill(transitions, NOT_COMPUTED);
+      }
+    }
+
+    @Override
+    public int hashCode() {
+      return hashCode;
+    }
+
+    @Override
+    public boolean equals(Object o) {

Review comment:
       *EqualsGetClass:*  Overriding Object#equals in a non-final class by using getClass rather than instanceof breaks substitutability of subclasses. [(details)](https://errorprone.info/bugpattern/EqualsGetClass)
   (at-me [in a reply](https://help.sonatype.com/lift/talking-to-lift) with `help` or `ignore`)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] zhaih commented on pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

zhaih commented on pull request #225:
URL: https://github.com/apache/lucene/pull/225#issuecomment-941199775


   Hmmm one of the test failed
   ```
   ERROR: The following test(s) have failed:
   > Task :lucene:analysis:smartcn:test
     - org.apache.lucene.index.TestIndexFileDeleter.testExcInDecRef (:lucene:core)
   :lucene:analysis:smartcn:test (SUCCESS): 21 test(s)
       Test output: /home/runner/work/lucene/lucene/lucene/core/build/test-results/test/outputs/OUTPUT-org.apache.lucene.index.TestIndexFileDeleter.txt
   
       Reproduce with: gradlew :lucene:core:test --tests "org.apache.lucene.index.TestIndexFileDeleter.testExcInDecRef" -Ptests.jvms=1 -Ptests.jvmargs=-XX:TieredStopAtLevel=1 -Ptests.seed=FD14DA9475FFAE2C -Ptests.slow=false -Ptests.file.encoding=UTF-8
   ```
   But I can't reproduce locally...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mikemccand commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

mikemccand commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r694036897



##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/CompiledAutomaton.java
##########
@@ -261,6 +266,21 @@ public CompiledAutomaton(
     sinkState = findSinkState(this.automaton);
   }
 
+  public CompiledAutomaton(Automaton automaton, boolean isNFA) {

Review comment:
       Maybe we could introduce an effectively boolean valued `enum` so that caller would have to use e.g. `AutomatonType.NFA|DFA` to make it clearer?

##########
File path: lucene/core/src/java/org/apache/lucene/search/AutomatonQuery.java
##########
@@ -100,8 +100,12 @@ public AutomatonQuery(
     this.term = term;
     this.automaton = automaton;
     this.automatonIsBinary = isBinary;
-    // TODO: we could take isFinite too, to save a bit of CPU in CompiledAutomaton ctor?:
-    this.compiled = new CompiledAutomaton(automaton, null, true, determinizeWorkLimit, isBinary);
+    if (determinizeWorkLimit == 0) {

Review comment:
       Hmm need to update javadoc above to say that passing `determinizeWorkLimit==0` means to use NFA instead?  Or rather it means "determinize on demand, state by state, as terms in the index require"?

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/Stepable.java
##########
@@ -0,0 +1,13 @@
+package org.apache.lucene.util.automaton;

Review comment:
       Add copyright header?

##########
File path: lucene/core/src/test/org/apache/lucene/util/automaton/TestNFARunAutomaton.java
##########
@@ -0,0 +1,175 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.Set;
+
+import org.apache.lucene.codecs.lucene90.Lucene90Codec;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.IndexWriter;
+import org.apache.lucene.index.IndexWriterConfig;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.search.AutomatonQuery;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.util.IntsRef;
+import org.apache.lucene.util.LuceneTestCase;
+import org.apache.lucene.util.TestUtil;
+import org.apache.lucene.util.ToStringUtils;
+
+public class TestNFARunAutomaton extends LuceneTestCase {
+
+  private static final String FIELD = "field";
+
+  public void testWithRandomAutomaton() {
+    for (int i = 0; i < 100; i++) {
+      RegExp regExp = null;
+      while (regExp == null) {
+        try {
+          regExp = new RegExp(AutomatonTestUtil.randomRegexp(random()));
+        } catch (IllegalArgumentException e) {
+          ignoreException(e);
+        }
+      }
+      Automaton dfa = regExp.toDFA();
+      NFARunAutomaton candidate = new NFARunAutomaton(regExp.toNFA());

Review comment:
       Another maybe powerful way to test NFA behavior would be to create any random DFA (e.g. make random set of strings and call that Daciuk/Mihov builder, or random RegExp.toDFA() like here, or maybe even randomly construct something state by state and transition by transition), then create a new test-only method that converts any DFA back into an NFA by randomly picking a DFA state and duplicating it (preserve all incoming and leaving transitions), or maybe by generating N strings accepted by the DFA and unioning them back into it.  Do that N times so N states get duplicated.  This should not alter the language accepted by the automaton, but should make it very "N".
   
   Finally, from the DFA, randomly enumerate strings it accepts, and then assert the `NFARunAutomaton` also accepts them.  And, sometimes randomly generate random strings that are not accepted by the DFA, and confirm the NFA also does not accept them.

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##########
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA
+ * state along with the run
+ *
+ * <p>implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map<DState, Integer> dStateToOrd = new HashMap<>(); // could init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+
+  /**
+   * Constructor, assuming alphabet size is the whole codepoint space
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} for
+   *     better efficiency
+   */
+  public NFARunAutomaton(Automaton automaton) {
+    this(automaton, Character.MAX_CODE_POINT);
+  }
+
+  /**
+   * Constructor
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} *
+   *     for better efficiency
+   * @param alphabetSize alphabet size
+   */
+  public NFARunAutomaton(Automaton automaton, int alphabetSize) {
+    this.automaton = automaton;
+    points = automaton.getStartPoints();
+    this.alphabetSize = alphabetSize;
+    dStates = new DState[10];
+    findDState(new DState(new int[] {0}));
+  }
+
+  /**
+   * For a given state and an incoming character (codepoint), return the next state
+   *
+   * @param state incoming state, should either be 0 or some state that is returned previously by
+   *     this function
+   * @param c codepoint
+   * @return the next state or {@link #MISSING} if the transition doesn't exist
+   */
+  public int step(int state, int c) {
+    assert dStates[state] != null;
+    return step(dStates[state], c);
+  }
+
+  /**
+   * Run through a given codepoint array, return accepted or not, should only be used in test
+   *
+   * @param s String represented by an int array
+   * @return accept or not
+   */
+  boolean run(int[] s) {
+    int p = 0;
+    for (int c : s) {
+      p = step(p, c);
+      if (p == MISSING) return false;
+    }
+    return dStates[p].isAccept;
+  }
+
+  /**
+   * From an existing DFA state, step to next DFA state given character c if the transition is
+   * previously tried then this operation will just use the cached result, otherwise it will call
+   * {@link #step(int[], int)} to get the next state and cache the result
+   */
+  private int step(DState dState, int c) {
+    int charClass = getCharClass(c);
+    if (dState.nextState(charClass) == NOT_COMPUTED) {
+      // the next dfa state has not been computed yet
+      dState.setNextState(charClass, findDState(step(dState.nfaStates, c)));
+    }
+    return dState.nextState(charClass);
+  }
+
+  /**
+   * given a list of NFA states and a character c, compute the output list of NFA state which is
+   * wrapped as a DFA state
+   */
+  private DState step(int[] nfaStates, int c) {
+    Transition transition = new Transition();
+    StateSet stateSet = new StateSet(5); // fork IntHashSet from hppc instead?
+    int numTransitions;
+    for (int nfaState : nfaStates) {
+      numTransitions = automaton.initTransition(nfaState, transition);
+      for (int i = 0; i < numTransitions; i++) {
+        automaton.getNextTransition(transition);
+        if (transition.min <= c && transition.max >= c) {
+          stateSet.incr(transition.dest);
+        }
+      }
+    }
+    if (stateSet.size() == 0) {
+      return null;
+    }
+    return new DState(stateSet.getArray());
+  }
+
+  /**
+   * return the ordinal of given DFA state, generate a new ordinal if the given DFA state is a new
+   * one
+   */
+  private int findDState(DState dState) {
+    if (dState == null) {
+      return MISSING;
+    }
+    int ord = dStateToOrd.getOrDefault(dState, -1);
+    if (ord >= 0) {
+      return ord;
+    }
+    ord = dStateToOrd.size();
+    dStateToOrd.put(dState, ord);
+    assert ord >= dStates.length || dStates[ord] == null;
+    if (ord >= dStates.length) {
+      dStates = ArrayUtil.grow(dStates, ord + 1);
+    }
+    dStates[ord] = dState;
+    return ord;
+  }
+
+  /** Gets character class of given codepoint */
+  final int getCharClass(int c) {
+    assert c < alphabetSize;
+    // binary search
+    int a = 0;
+    int b = points.length;
+    while (b - a > 1) {
+      int d = (a + b) >>> 1;
+      if (points[d] > c) b = d;
+      else if (points[d] < c) a = d;
+      else return d;
+    }
+    return a;
+  }
+
+  private class DState {
+    private final int[] nfaStates;
+    private int[] transitions;
+    private final int hashCode;
+    private final boolean isAccept;
+
+    private DState(int[] nfaStates) {
+      assert nfaStates != null && nfaStates.length > 0;
+      this.nfaStates = nfaStates;
+      int hashCode = nfaStates.length;
+      boolean isAccept = false;
+      for (int s : nfaStates) {
+        hashCode += BitMixer.mix(s);
+        if (automaton.isAccept(s)) {
+          isAccept = true;
+        }
+      }
+      this.isAccept = isAccept;
+      this.hashCode = hashCode;
+    }
+
+    private int nextState(int charClass) {
+      initTransitions();
+      assert charClass < transitions.length;
+      return transitions[charClass];
+    }
+
+    private void setNextState(int charClass, int nextState) {
+      initTransitions();
+      assert charClass < transitions.length;
+      transitions[charClass] = nextState;
+    }
+
+    private void initTransitions() {
+      if (transitions == null) {
+        transitions = new int[points.length];

Review comment:
       Yeah, I think it's fair.  You are basically building up the same table, node by node lazily on-demand, that `RunAutomaton` on a DFA creates entirely up front.

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/CompiledAutomaton.java
##########
@@ -261,6 +266,21 @@ public CompiledAutomaton(
     sinkState = findSinkState(this.automaton);
   }
 
+  public CompiledAutomaton(Automaton automaton, boolean isNFA) {
+    // nocommit: the parameter "isNFA" makes no sense, is only used to distinguish the ctor
+    assert automaton.isDeterministic() == false;

Review comment:
       Maybe make this a real `if`?

##########
File path: lucene/core/src/test/org/apache/lucene/util/automaton/TestNFARunAutomaton.java
##########
@@ -0,0 +1,175 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.Set;
+
+import org.apache.lucene.codecs.lucene90.Lucene90Codec;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.IndexWriter;
+import org.apache.lucene.index.IndexWriterConfig;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.search.AutomatonQuery;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.util.IntsRef;
+import org.apache.lucene.util.LuceneTestCase;
+import org.apache.lucene.util.TestUtil;
+import org.apache.lucene.util.ToStringUtils;
+
+public class TestNFARunAutomaton extends LuceneTestCase {
+
+  private static final String FIELD = "field";
+
+  public void testWithRandomAutomaton() {
+    for (int i = 0; i < 100; i++) {
+      RegExp regExp = null;
+      while (regExp == null) {
+        try {
+          regExp = new RegExp(AutomatonTestUtil.randomRegexp(random()));
+        } catch (IllegalArgumentException e) {
+          ignoreException(e);
+        }
+      }
+      Automaton dfa = regExp.toDFA();
+      NFARunAutomaton candidate = new NFARunAutomaton(regExp.toNFA());
+      AutomatonTestUtil.RandomAcceptedStrings randomStringGen =
+          new AutomatonTestUtil.RandomAcceptedStrings(dfa);
+
+      for (int round = 0; round < 20; round++) {
+        // test order of accepted strings and random (likely rejected) strings alternatively to make
+        // sure caching system works correctly
+        if (random().nextBoolean()) {
+          testAcceptedString(regExp, randomStringGen, candidate, 10);
+          testRandomString(regExp, dfa, candidate, 10);
+        } else {
+          testRandomString(regExp, dfa, candidate, 10);
+          testAcceptedString(regExp, randomStringGen, candidate, 10);
+        }
+      }
+    }
+  }
+
+  public void testWithRandomAutomatonQuery() throws IOException {
+    final int docNum = 50;
+    final int automatonNum = 50;
+    Directory directory = newDirectory();
+    IndexWriterConfig iwc = new IndexWriterConfig();
+    iwc.setCodec(new Lucene90Codec());
+    IndexWriter writer = new IndexWriter(directory, iwc);
+
+    Set<String> vocab = new HashSet<>();
+    Set<String> perLoopReuse = new HashSet<>();
+    for (int i = 0; i < docNum; i++) {
+      perLoopReuse.clear();
+      int termNum = random().nextInt(20) + 30;
+      while (perLoopReuse.size() < termNum) {
+        String randomString;
+        while ((randomString = TestUtil.randomUnicodeString(random())).length() == 0);
+        perLoopReuse.add(randomString);
+        vocab.add(randomString);
+      }
+      Document document = new Document();
+      document.add(newTextField(FIELD,
+                   perLoopReuse.stream().reduce("", (s1, s2) -> s1 + " " + s2),
+                   Field.Store.NO));
+      writer.addDocument(document);
+    }
+    writer.commit();
+    IndexReader reader = DirectoryReader.open(writer);
+    IndexSearcher searcher = new IndexSearcher(reader);
+
+    Set<String> foreignVocab = new HashSet<>();
+    while (foreignVocab.size() < vocab.size()) {
+      String randomString;
+      while ((randomString = TestUtil.randomUnicodeString(random())).length() == 0);
+      foreignVocab.add(randomString);
+    }
+
+    ArrayList<String> vocabList = new ArrayList<>(vocab);
+    ArrayList<String> foreignVocabList = new ArrayList<>(foreignVocab);
+
+    for (int i = 0; i < automatonNum; i++) {
+      perLoopReuse.clear();
+      int termNum = random().nextInt(40) + 30;
+      while (perLoopReuse.size() < termNum) {
+        if (random().nextBoolean()) {
+          perLoopReuse.add(vocabList.get(random().nextInt(vocabList.size())));
+        } else {
+          perLoopReuse.add(foreignVocabList.get(random().nextInt(foreignVocabList.size())));
+        }
+      }
+      Automaton a = null;
+      for (String term: perLoopReuse) {
+        if (a == null) {
+          a = Automata.makeString(term);
+        } else {
+          a = Operations.union(a, Automata.makeString(term));
+        }
+      }
+      if (a.isDeterministic()) {
+        i--;
+        continue;
+      }
+      Query dfaQuery = new AutomatonQuery(new Term(FIELD), a);
+      Query nfaQuery = new AutomatonQuery(new Term(FIELD), a, 0);

Review comment:
       It's very cool you are able to completely reuse `AutomatonQuery` to run either an NFA or DFA by using `determinizeWorkLimit = 0` to mean "create an NFA", since you made the two interfaces.  Versus forking and then requiring the user make an `NFAAutomatonQuery`.
   
   But I think it's a little dangerous to take such a low-level approach?  (Your `// nocommit` above).  Because that approach means anyone who calls determinize with a `0` work limit will quietly get an NFA back, not just `AutomatonQuery` who knows how to handle the NFA properly.  I.e. other places that call `determinize` expect to always get back a deterministic automaton?  Maybe, instead, we could instead just change `AutomatonQuery` to take an optional `determinizeUpFront` boolean, which would default to `true` (preserving current behavior)?
   
   Or, maybe keep the `determinizeWorkLimit==0` to mean "use NFA", but move that `if` up from way down in `determinize` to higher up in `AutomatonQuery`?

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/TransitionAccessor.java
##########
@@ -0,0 +1,15 @@
+package org.apache.lucene.util.automaton;

Review comment:
       Copyright header here too?

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##########
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA
+ * state along with the run
+ *
+ * <p>implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map<DState, Integer> dStateToOrd = new HashMap<>(); // could init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+
+  /**
+   * Constructor, assuming alphabet size is the whole codepoint space
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} for
+   *     better efficiency
+   */
+  public NFARunAutomaton(Automaton automaton) {
+    this(automaton, Character.MAX_CODE_POINT);
+  }
+
+  /**
+   * Constructor
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} *
+   *     for better efficiency
+   * @param alphabetSize alphabet size
+   */
+  public NFARunAutomaton(Automaton automaton, int alphabetSize) {
+    this.automaton = automaton;
+    points = automaton.getStartPoints();
+    this.alphabetSize = alphabetSize;
+    dStates = new DState[10];
+    findDState(new DState(new int[] {0}));
+  }
+
+  /**
+   * For a given state and an incoming character (codepoint), return the next state
+   *
+   * @param state incoming state, should either be 0 or some state that is returned previously by
+   *     this function
+   * @param c codepoint
+   * @return the next state or {@link #MISSING} if the transition doesn't exist
+   */
+  public int step(int state, int c) {
+    assert dStates[state] != null;
+    return step(dStates[state], c);
+  }
+
+  /**
+   * Run through a given codepoint array, return accepted or not, should only be used in test
+   *
+   * @param s String represented by an int array
+   * @return accept or not
+   */
+  boolean run(int[] s) {

Review comment:
       OK, I agree.

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##########
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA
+ * state along with the run
+ *
+ * <p>implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map<DState, Integer> dStateToOrd = new HashMap<>(); // could init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+
+  /**
+   * Constructor, assuming alphabet size is the whole codepoint space
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} for
+   *     better efficiency
+   */
+  public NFARunAutomaton(Automaton automaton) {
+    this(automaton, Character.MAX_CODE_POINT);
+  }
+
+  /**
+   * Constructor
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} *
+   *     for better efficiency
+   * @param alphabetSize alphabet size
+   */
+  public NFARunAutomaton(Automaton automaton, int alphabetSize) {
+    this.automaton = automaton;
+    points = automaton.getStartPoints();
+    this.alphabetSize = alphabetSize;
+    dStates = new DState[10];
+    findDState(new DState(new int[] {0}));
+  }
+
+  /**
+   * For a given state and an incoming character (codepoint), return the next state
+   *
+   * @param state incoming state, should either be 0 or some state that is returned previously by
+   *     this function
+   * @param c codepoint
+   * @return the next state or {@link #MISSING} if the transition doesn't exist
+   */
+  public int step(int state, int c) {
+    assert dStates[state] != null;
+    return step(dStates[state], c);
+  }
+
+  /**
+   * Run through a given codepoint array, return accepted or not, should only be used in test
+   *
+   * @param s String represented by an int array
+   * @return accept or not
+   */
+  boolean run(int[] s) {
+    int p = 0;
+    for (int c : s) {
+      p = step(p, c);
+      if (p == MISSING) return false;
+    }
+    return dStates[p].isAccept;
+  }
+
+  /**
+   * From an existing DFA state, step to next DFA state given character c if the transition is
+   * previously tried then this operation will just use the cached result, otherwise it will call
+   * {@link #step(int[], int)} to get the next state and cache the result
+   */
+  private int step(DState dState, int c) {
+    int charClass = getCharClass(c);
+    if (dState.nextState(charClass) == NOT_COMPUTED) {
+      // the next dfa state has not been computed yet
+      dState.setNextState(charClass, findDState(step(dState.nfaStates, c)));
+    }
+    return dState.nextState(charClass);
+  }
+
+  /**
+   * given a list of NFA states and a character c, compute the output list of NFA state which is
+   * wrapped as a DFA state
+   */
+  private DState step(int[] nfaStates, int c) {
+    Transition transition = new Transition();
+    StateSet stateSet = new StateSet(5); // fork IntHashSet from hppc instead?
+    int numTransitions;
+    for (int nfaState : nfaStates) {
+      numTransitions = automaton.initTransition(nfaState, transition);
+      for (int i = 0; i < numTransitions; i++) {
+        automaton.getNextTransition(transition);
+        if (transition.min <= c && transition.max >= c) {
+          stateSet.incr(transition.dest);
+        }
+      }
+    }
+    if (stateSet.size() == 0) {
+      return null;
+    }
+    return new DState(stateSet.getArray());
+  }
+
+  /**
+   * return the ordinal of given DFA state, generate a new ordinal if the given DFA state is a new
+   * one
+   */
+  private int findDState(DState dState) {
+    if (dState == null) {
+      return MISSING;
+    }
+    int ord = dStateToOrd.getOrDefault(dState, -1);
+    if (ord >= 0) {
+      return ord;
+    }
+    ord = dStateToOrd.size();
+    dStateToOrd.put(dState, ord);
+    assert ord >= dStates.length || dStates[ord] == null;
+    if (ord >= dStates.length) {
+      dStates = ArrayUtil.grow(dStates, ord + 1);
+    }
+    dStates[ord] = dState;
+    return ord;
+  }
+
+  /** Gets character class of given codepoint */
+  final int getCharClass(int c) {
+    assert c < alphabetSize;
+    // binary search
+    int a = 0;
+    int b = points.length;
+    while (b - a > 1) {
+      int d = (a + b) >>> 1;
+      if (points[d] > c) b = d;
+      else if (points[d] < c) a = d;
+      else return d;
+    }
+    return a;
+  }
+
+  private class DState {

Review comment:
       OK, I think we can clean up the code duplication later -- it's not urgent, certainly not for first prototype PR.  Let's just focus on making NFA work well here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] zhaih commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

zhaih commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r772592359



##########
File path: lucene/codecs/src/java/org/apache/lucene/codecs/memory/DirectPostingsFormat.java
##########
@@ -962,15 +964,22 @@ public ImpactsEnum impacts(int flags) throws IOException {
       private int stateUpto;
 
       public DirectIntersectTermsEnum(CompiledAutomaton compiled, BytesRef startTerm) {
-        runAutomaton = compiled.runAutomaton;
-        compiledAutomaton = compiled;
+        if (compiled.nfaRunAutomaton != null) {
+          this.runAutomaton = compiled.nfaRunAutomaton;

Review comment:
       Hmmm I checked it again and it's a bit complex to merge the `runAutomaton` and `nfaRunAutomaton` into one, because previously `runAutomaon` is having public access and kind of wildly used, it's type directly appears in some public API such as `QueryVisitor#consumesTermsMatching`, so it might take another PR to try to merge them, I left a todo for now. 
   
   An alternative way I'm thinking about is to move those `if` inside the `CompiledAutomaton`, and only expose methods like `getByteRunnable` so that people don't manipulate them outside, I'll include that in next commit.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] zhaih commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

zhaih commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r762178718



##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##########
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA. It will lazily determinize on-demand, memorizing the
+ * generated DFA states that has been explored
+ *
+ * <p>implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton implements ByteRunnable, TransitionAccessor {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map<DState, Integer> dStateToOrd = new HashMap<>(); // could init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+  final int[] classmap; // map from char number to class
+
+  private final Operations.PointTransitionSet transitionSet =
+      new Operations.PointTransitionSet(); // reusable
+  private final StateSet statesSet = new StateSet(5); // reusable
+
+  /**
+   * Constructor, assuming alphabet size is the whole Unicode code point space
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} for
+   *     better efficiency
+   */
+  public NFARunAutomaton(Automaton automaton) {
+    this(automaton, Character.MAX_CODE_POINT);
+  }
+
+  /**
+   * Constructor
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} *
+   *     for better efficiency
+   * @param alphabetSize alphabet size
+   */
+  public NFARunAutomaton(Automaton automaton, int alphabetSize) {
+    this.automaton = automaton;
+    points = automaton.getStartPoints();
+    this.alphabetSize = alphabetSize;
+    dStates = new DState[10];
+    findDState(new DState(new int[] {0}));
+
+    /*
+     * Set alphabet table for optimal run performance.
+     */
+    classmap = new int[Math.min(256, alphabetSize)];
+    int i = 0;
+    for (int j = 0; j < classmap.length; j++) {
+      if (i + 1 < points.length && j == points[i + 1]) {
+        i++;
+      }
+      classmap[j] = i;
+    }
+  }
+
+  /**
+   * For a given state and an incoming character (codepoint), return the next state
+   *
+   * @param state incoming state, should either be 0 or some state that is returned previously by
+   *     this function
+   * @param c codepoint
+   * @return the next state or {@link #MISSING} if the transition doesn't exist
+   */
+  @Override
+  public int step(int state, int c) {
+    assert dStates[state] != null;
+    return step(dStates[state], c);
+  }
+
+  @Override
+  public boolean isAccept(int state) {
+    assert dStates[state] != null;
+    return dStates[state].isAccept;
+  }
+
+  @Override
+  public int getSize() {
+    return dStates.length;
+  }
+
+  /**
+   * Run through a given codepoint array, return accepted or not, should only be used in test
+   *
+   * @param s String represented by an int array
+   * @return accept or not
+   */
+  boolean run(int[] s) {
+    int p = 0;
+    for (int c : s) {
+      p = step(p, c);
+      if (p == MISSING) return false;
+    }
+    return dStates[p].isAccept;
+  }
+
+  /**
+   * From an existing DFA state, step to next DFA state given character c if the transition is
+   * previously tried then this operation will just use the cached result, otherwise it will call
+   * {@link DState#step(int)} to get the next state and cache the result
+   */
+  private int step(DState dState, int c) {
+    int charClass = getCharClass(c);
+    return dState.nextState(charClass);
+  }
+
+  /**
+   * return the ordinal of given DFA state, generate a new ordinal if the given DFA state is a new
+   * one
+   */
+  private int findDState(DState dState) {
+    if (dState == null) {
+      return MISSING;
+    }
+    int ord = dStateToOrd.getOrDefault(dState, -1);
+    if (ord >= 0) {
+      return ord;
+    }
+    ord = dStateToOrd.size();
+    dStateToOrd.put(dState, ord);
+    assert ord >= dStates.length || dStates[ord] == null;
+    if (ord >= dStates.length) {
+      dStates = ArrayUtil.grow(dStates, ord + 1);
+    }
+    dStates[ord] = dState;
+    return ord;
+  }
+
+  /** Gets character class of given codepoint */
+  final int getCharClass(int c) {
+    assert c < alphabetSize;
+
+    if (c < classmap.length) {
+      return classmap[c];
+    }
+
+    // binary search
+    int a = 0;
+    int b = points.length;
+    while (b - a > 1) {
+      int d = (a + b) >>> 1;
+      if (points[d] > c) b = d;
+      else if (points[d] < c) a = d;
+      else return d;
+    }
+    return a;
+  }
+
+  @Override
+  public int initTransition(int state, Transition t) {
+    t.source = state;
+    t.transitionUpto = -1;
+    return getNumTransitions(state);
+  }
+
+  @Override
+  public void getNextTransition(Transition t) {
+    assert t.transitionUpto < points.length - 1 && t.transitionUpto >= -1;
+    while (dStates[t.source].transitions[++t.transitionUpto] == MISSING) {
+      // this shouldn't throw AIOOBE as long as this function is only called
+      // numTransitions times
+    }
+    assert dStates[t.source].transitions[t.transitionUpto] != NOT_COMPUTED;
+    t.dest = dStates[t.source].transitions[t.transitionUpto];
+
+    t.min = points[t.transitionUpto];
+    if (t.transitionUpto == points.length - 1) {
+      t.max = alphabetSize - 1;
+    } else {
+      t.max = points[t.transitionUpto + 1] - 1;
+    }
+  }
+
+  @Override
+  public int getNumTransitions(int state) {
+    dStates[state].determinize();
+    return dStates[state].outgoingTransitions;
+  }
+
+  @Override
+  public void getTransition(int state, int index, Transition t) {
+    dStates[state].determinize();
+    int outgoingTransitions = -1;
+    t.transitionUpto = -1;
+    t.source = state;
+    while (outgoingTransitions < index && t.transitionUpto < points.length - 1) {
+      if (dStates[t.source].transitions[++t.transitionUpto] != MISSING) {
+        outgoingTransitions++;
+      }
+    }
+    assert outgoingTransitions == index;
+
+    t.min = points[t.transitionUpto];
+    if (t.transitionUpto == points.length - 1) {
+      t.max = alphabetSize - 1;
+    } else {
+      t.max = points[t.transitionUpto + 1] - 1;
+    }
+  }
+
+  private class DState {
+    private final int[] nfaStates;
+    // this field is lazily init'd when first time caller wants to add a new transition
+    private int[] transitions;
+    private final int hashCode;
+    private final boolean isAccept;
+    private final Transition stepTransition = new Transition();
+    private Transition minimalTransition;
+    private int computedTransitions;
+    private int outgoingTransitions;
+
+    private DState(int[] nfaStates) {
+      assert nfaStates != null && nfaStates.length > 0;
+      this.nfaStates = nfaStates;
+      int hashCode = nfaStates.length;
+      boolean isAccept = false;
+      for (int s : nfaStates) {
+        hashCode += BitMixer.mix(s);
+        if (automaton.isAccept(s)) {
+          isAccept = true;
+        }
+      }
+      this.isAccept = isAccept;
+      this.hashCode = hashCode;
+    }
+
+    private int nextState(int charClass) {
+      initTransitions();
+      assert charClass < transitions.length;
+      if (transitions[charClass] == NOT_COMPUTED) {
+        assignTransition(charClass, findDState(step(points[charClass])));
+        // we could potentially update more than one char classes
+        if (minimalTransition != null) {
+          // to the left
+          int cls = charClass;
+          while (cls > 0 && points[--cls] >= minimalTransition.min) {
+            assert transitions[cls] == NOT_COMPUTED || transitions[cls] == transitions[charClass];
+            assignTransition(cls, transitions[charClass]);
+          }
+          // to the right
+          cls = charClass;
+          while (cls < points.length - 1 && points[++cls] <= minimalTransition.max) {
+            assert transitions[cls] == NOT_COMPUTED || transitions[cls] == transitions[charClass];
+            assignTransition(cls, transitions[charClass]);
+          }
+          minimalTransition = null;
+        }
+      }
+      return transitions[charClass];
+    }
+
+    private void assignTransition(int charClass, int dest) {
+      if (transitions[charClass] == NOT_COMPUTED) {
+        computedTransitions++;
+        transitions[charClass] = dest;
+        if (transitions[charClass] != MISSING) {
+          outgoingTransitions++;
+        }
+      }
+    }
+
+    /**
+     * given a list of NFA states and a character c, compute the output list of NFA state which is
+     * wrapped as a DFA state
+     */
+    private DState step(int c) {
+      statesSet.reset(); // TODO: fork IntHashSet from hppc instead?
+      int numTransitions;
+      int left = -1, right = alphabetSize;
+      for (int nfaState : nfaStates) {
+        numTransitions = automaton.initTransition(nfaState, stepTransition);
+        // TODO: binary search should be faster, since transitions are sorted
+        for (int i = 0; i < numTransitions; i++) {
+          automaton.getNextTransition(stepTransition);
+          if (stepTransition.min <= c && stepTransition.max >= c) {
+            statesSet.incr(stepTransition.dest);
+            left = Math.max(stepTransition.min, left);
+            right = Math.min(stepTransition.max, right);
+          }
+          if (stepTransition.max < c) {
+            left = Math.max(stepTransition.max + 1, left);
+          }
+          if (stepTransition.min > c) {
+            right = Math.min(stepTransition.min - 1, right);
+            // transitions in automaton are sorted
+            break;
+          }
+        }
+      }
+      if (statesSet.size() == 0) {
+        return null;
+      }
+      minimalTransition = new Transition();
+      minimalTransition.min = left;
+      minimalTransition.max = right;
+      return new DState(statesSet.getArray());
+    }
+
+    // determinize this state only
+    private void determinize() {
+      if (transitions != null && computedTransitions == transitions.length) {
+        // already determinized
+        return;
+      }
+      initTransitions();
+      // Mostly forked from Operations.determinize
+      transitionSet.reset();
+      for (int nfaState : nfaStates) {
+        int numTransitions = automaton.initTransition(nfaState, stepTransition);
+        for (int i = 0; i < numTransitions; i++) {
+          automaton.getNextTransition(stepTransition);
+          transitionSet.add(stepTransition);
+        }
+      }
+      if (transitionSet.count == 0) {
+        // no outgoing transitions
+        Arrays.fill(transitions, MISSING);
+        computedTransitions = transitions.length;
+        return;
+      }
+
+      transitionSet
+          .sort(); // TODO: could use a PQ (heap) instead, since transitions for each state are
+      // sorted
+      statesSet.reset();
+      int lastPoint = -1;
+      int charClass = 0;
+      for (int i = 0; i < transitionSet.count; i++) {
+        final int point = transitionSet.points[i].point;
+        if (statesSet.size() > 0) {
+          assert lastPoint != -1;
+          int ord = findDState(new DState(statesSet.getArray()));
+          while (points[charClass] < lastPoint) {
+            assignTransition(charClass++, MISSING);
+          }
+          assert points[charClass] == lastPoint;
+          while (charClass < points.length && points[charClass] < point) {
+            assert transitions[charClass] == NOT_COMPUTED || transitions[charClass] == ord;
+            assignTransition(charClass++, ord);
+          }
+          assert (charClass == points.length && point == alphabetSize)
+              || points[charClass] == point;
+        }
+
+        // process transitions that end on this point
+        // (closes an overlapping interval)
+        int[] transitions = transitionSet.points[i].ends.transitions;
+        int limit = transitionSet.points[i].ends.next;
+        for (int j = 0; j < limit; j += 3) {
+          int dest = transitions[j];
+          statesSet.decr(dest);
+        }
+        transitionSet.points[i].ends.next = 0;
+
+        // process transitions that start on this point
+        // (opens a new interval)
+        transitions = transitionSet.points[i].starts.transitions;
+        limit = transitionSet.points[i].starts.next;
+        for (int j = 0; j < limit; j += 3) {
+          int dest = transitions[j];
+          statesSet.incr(dest);
+        }
+
+        lastPoint = point;
+        transitionSet.points[i].starts.next = 0;
+      }
+      assert statesSet.size() == 0;
+      assert computedTransitions
+          >= charClass; // it's also possible that some transitions after the charClass has already
+      // been explored
+      // no more outgoing transitions, set rest of transition to MISSING
+      assert charClass == transitions.length
+          || transitions[charClass] == MISSING
+          || transitions[charClass] == NOT_COMPUTED;
+      Arrays.fill(transitions, charClass, transitions.length, MISSING);
+      computedTransitions = transitions.length;
+    }
+
+    private void initTransitions() {
+      if (transitions == null) {
+        transitions = new int[points.length];
+        Arrays.fill(transitions, NOT_COMPUTED);
+      }
+    }
+
+    @Override
+    public int hashCode() {
+      return hashCode;
+    }
+
+    @Override
+    public boolean equals(Object o) {

Review comment:
       @sonatype-lift ignore




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

rmuir commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r766065104



##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
##########
@@ -614,16 +624,17 @@ public Automaton toAutomaton(AutomatonProvider automaton_provider, int determini
    */
   public Automaton toAutomaton(Map<String, Automaton> automata, int determinizeWorkLimit)
       throws IllegalArgumentException, TooComplexToDeterminizeException {
-    return toAutomaton(automata, null, determinizeWorkLimit);
+    return toAutomaton(automata, null, determinizeWorkLimit, true);
   }
 
   private Automaton toAutomaton(
       Map<String, Automaton> automata,
       AutomatonProvider automaton_provider,
-      int determinizeWorkLimit)
+      int determinizeWorkLimit,
+      boolean buildDFA)

Review comment:
       hmm, i only had 2 comments for the code the review. but this "conversation was unresolved" so it seems to have re-posted my old outdated comments from before! I'll see if it did this elsewhere. sorry!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mikemccand commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

mikemccand commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r677506600



##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##########
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA
+ * state along with the run
+ *
+ * <p>implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map<DState, Integer> dStateToOrd = new HashMap<>(); // could init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+
+  /**
+   * Constructor, assuming alphabet size is the whole codepoint space

Review comment:
       Maybe say `whole Unicode code point space`?

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##########
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA
+ * state along with the run
+ *
+ * <p>implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map<DState, Integer> dStateToOrd = new HashMap<>(); // could init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+
+  /**
+   * Constructor, assuming alphabet size is the whole codepoint space
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} for
+   *     better efficiency
+   */
+  public NFARunAutomaton(Automaton automaton) {
+    this(automaton, Character.MAX_CODE_POINT);
+  }
+
+  /**
+   * Constructor
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} *
+   *     for better efficiency
+   * @param alphabetSize alphabet size
+   */
+  public NFARunAutomaton(Automaton automaton, int alphabetSize) {
+    this.automaton = automaton;
+    points = automaton.getStartPoints();
+    this.alphabetSize = alphabetSize;
+    dStates = new DState[10];
+    findDState(new DState(new int[] {0}));
+  }
+
+  /**
+   * For a given state and an incoming character (codepoint), return the next state
+   *
+   * @param state incoming state, should either be 0 or some state that is returned previously by
+   *     this function
+   * @param c codepoint
+   * @return the next state or {@link #MISSING} if the transition doesn't exist
+   */
+  public int step(int state, int c) {
+    assert dStates[state] != null;
+    return step(dStates[state], c);
+  }
+
+  /**
+   * Run through a given codepoint array, return accepted or not, should only be used in test
+   *
+   * @param s String represented by an int array
+   * @return accept or not
+   */
+  boolean run(int[] s) {
+    int p = 0;
+    for (int c : s) {
+      p = step(p, c);
+      if (p == MISSING) return false;
+    }
+    return dStates[p].isAccept;
+  }
+
+  /**
+   * From an existing DFA state, step to next DFA state given character c if the transition is
+   * previously tried then this operation will just use the cached result, otherwise it will call
+   * {@link #step(int[], int)} to get the next state and cache the result
+   */
+  private int step(DState dState, int c) {
+    int charClass = getCharClass(c);
+    if (dState.nextState(charClass) == NOT_COMPUTED) {
+      // the next dfa state has not been computed yet
+      dState.setNextState(charClass, findDState(step(dState.nfaStates, c)));
+    }
+    return dState.nextState(charClass);
+  }
+
+  /**
+   * given a list of NFA states and a character c, compute the output list of NFA state which is
+   * wrapped as a DFA state
+   */
+  private DState step(int[] nfaStates, int c) {
+    Transition transition = new Transition();
+    StateSet stateSet = new StateSet(5); // fork IntHashSet from hppc instead?
+    int numTransitions;
+    for (int nfaState : nfaStates) {
+      numTransitions = automaton.initTransition(nfaState, transition);
+      for (int i = 0; i < numTransitions; i++) {
+        automaton.getNextTransition(transition);
+        if (transition.min <= c && transition.max >= c) {
+          stateSet.incr(transition.dest);
+        }
+      }
+    }
+    if (stateSet.size() == 0) {
+      return null;
+    }
+    return new DState(stateSet.getArray());
+  }
+
+  /**
+   * return the ordinal of given DFA state, generate a new ordinal if the given DFA state is a new
+   * one
+   */
+  private int findDState(DState dState) {
+    if (dState == null) {
+      return MISSING;
+    }
+    int ord = dStateToOrd.getOrDefault(dState, -1);
+    if (ord >= 0) {
+      return ord;
+    }
+    ord = dStateToOrd.size();
+    dStateToOrd.put(dState, ord);
+    assert ord >= dStates.length || dStates[ord] == null;
+    if (ord >= dStates.length) {
+      dStates = ArrayUtil.grow(dStates, ord + 1);
+    }
+    dStates[ord] = dState;
+    return ord;
+  }
+
+  /** Gets character class of given codepoint */
+  final int getCharClass(int c) {
+    assert c < alphabetSize;
+    // binary search
+    int a = 0;
+    int b = points.length;
+    while (b - a > 1) {
+      int d = (a + b) >>> 1;
+      if (points[d] > c) b = d;
+      else if (points[d] < c) a = d;
+      else return d;
+    }
+    return a;
+  }
+
+  private class DState {
+    private final int[] nfaStates;
+    private int[] transitions;
+    private final int hashCode;
+    private final boolean isAccept;
+
+    private DState(int[] nfaStates) {
+      assert nfaStates != null && nfaStates.length > 0;
+      this.nfaStates = nfaStates;
+      int hashCode = nfaStates.length;
+      boolean isAccept = false;
+      for (int s : nfaStates) {
+        hashCode += BitMixer.mix(s);
+        if (automaton.isAccept(s)) {
+          isAccept = true;
+        }
+      }
+      this.isAccept = isAccept;
+      this.hashCode = hashCode;
+    }
+
+    private int nextState(int charClass) {
+      initTransitions();
+      assert charClass < transitions.length;
+      return transitions[charClass];
+    }
+
+    private void setNextState(int charClass, int nextState) {
+      initTransitions();
+      assert charClass < transitions.length;
+      transitions[charClass] = nextState;
+    }
+
+    private void initTransitions() {
+      if (transitions == null) {
+        transitions = new int[points.length];

Review comment:
       Hmm, `points` is global across the whole `Automaton` gathering all unique char classes that were ever seen in the automaton?  So this is overkill, to always allocate this many slots for outgoing transitions from this state?  Though, I guess this matches how `RunAutomaton` works, with its pre-compiled transition lookup table?

##########
File path: lucene/core/src/test/org/apache/lucene/util/automaton/TestNFARunAutomaton.java
##########
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import org.apache.lucene.util.IntsRef;
+import org.apache.lucene.util.LuceneTestCase;
+
+public class TestNFARunAutomaton extends LuceneTestCase {
+
+  public void testRandom() {
+    for (int i = 0; i < 100; i++) {
+      RegExp regExp = null;
+      while (regExp == null) {
+        try {
+          regExp = new RegExp(AutomatonTestUtil.randomRegexp(random()));
+        } catch (IllegalArgumentException e) {
+          ignoreException(e);

Review comment:
       Hmm did `javac` or one of our `ecj` linters demand that you do this :)  (Instead of leaving a comment explaining why we are ignoring).
   
   And maybe add a comment anyways explaining how `randomRegexp` will sometimes/often result in this exception?  In fact, maybe we should fix `AutomatonTestUtil` to do this "filtering" for us, so that it returns a valid `RegExp`?

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/MinimizationOperations.java
##########
@@ -53,6 +53,11 @@ private MinimizationOperations() {}
    *     what to specify.
    */
   public static Automaton minimize(Automaton a, int determinizeWorkLimit) {
+    // nocommit: probably shouldn't set the logic here

Review comment:
       Hmm why did you need this?

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##########
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA
+ * state along with the run
+ *
+ * <p>implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map<DState, Integer> dStateToOrd = new HashMap<>(); // could init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+
+  /**
+   * Constructor, assuming alphabet size is the whole codepoint space
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} for
+   *     better efficiency
+   */
+  public NFARunAutomaton(Automaton automaton) {
+    this(automaton, Character.MAX_CODE_POINT);
+  }
+
+  /**
+   * Constructor
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} *
+   *     for better efficiency
+   * @param alphabetSize alphabet size
+   */
+  public NFARunAutomaton(Automaton automaton, int alphabetSize) {
+    this.automaton = automaton;
+    points = automaton.getStartPoints();
+    this.alphabetSize = alphabetSize;
+    dStates = new DState[10];
+    findDState(new DState(new int[] {0}));
+  }
+
+  /**
+   * For a given state and an incoming character (codepoint), return the next state
+   *
+   * @param state incoming state, should either be 0 or some state that is returned previously by
+   *     this function
+   * @param c codepoint
+   * @return the next state or {@link #MISSING} if the transition doesn't exist
+   */
+  public int step(int state, int c) {
+    assert dStates[state] != null;
+    return step(dStates[state], c);
+  }
+
+  /**
+   * Run through a given codepoint array, return accepted or not, should only be used in test
+   *
+   * @param s String represented by an int array
+   * @return accept or not
+   */
+  boolean run(int[] s) {
+    int p = 0;
+    for (int c : s) {
+      p = step(p, c);
+      if (p == MISSING) return false;
+    }
+    return dStates[p].isAccept;
+  }
+
+  /**
+   * From an existing DFA state, step to next DFA state given character c if the transition is
+   * previously tried then this operation will just use the cached result, otherwise it will call
+   * {@link #step(int[], int)} to get the next state and cache the result
+   */
+  private int step(DState dState, int c) {
+    int charClass = getCharClass(c);
+    if (dState.nextState(charClass) == NOT_COMPUTED) {
+      // the next dfa state has not been computed yet
+      dState.setNextState(charClass, findDState(step(dState.nfaStates, c)));
+    }
+    return dState.nextState(charClass);
+  }
+
+  /**
+   * given a list of NFA states and a character c, compute the output list of NFA state which is
+   * wrapped as a DFA state
+   */
+  private DState step(int[] nfaStates, int c) {
+    Transition transition = new Transition();
+    StateSet stateSet = new StateSet(5); // fork IntHashSet from hppc instead?
+    int numTransitions;
+    for (int nfaState : nfaStates) {
+      numTransitions = automaton.initTransition(nfaState, transition);
+      for (int i = 0; i < numTransitions; i++) {
+        automaton.getNextTransition(transition);
+        if (transition.min <= c && transition.max >= c) {
+          stateSet.incr(transition.dest);
+        }
+      }
+    }
+    if (stateSet.size() == 0) {
+      return null;
+    }
+    return new DState(stateSet.getArray());
+  }
+
+  /**
+   * return the ordinal of given DFA state, generate a new ordinal if the given DFA state is a new
+   * one
+   */
+  private int findDState(DState dState) {
+    if (dState == null) {
+      return MISSING;
+    }
+    int ord = dStateToOrd.getOrDefault(dState, -1);
+    if (ord >= 0) {
+      return ord;
+    }
+    ord = dStateToOrd.size();
+    dStateToOrd.put(dState, ord);
+    assert ord >= dStates.length || dStates[ord] == null;
+    if (ord >= dStates.length) {
+      dStates = ArrayUtil.grow(dStates, ord + 1);
+    }
+    dStates[ord] = dState;
+    return ord;
+  }
+
+  /** Gets character class of given codepoint */
+  final int getCharClass(int c) {
+    assert c < alphabetSize;
+    // binary search
+    int a = 0;
+    int b = points.length;
+    while (b - a > 1) {
+      int d = (a + b) >>> 1;
+      if (points[d] > c) b = d;
+      else if (points[d] < c) a = d;
+      else return d;
+    }
+    return a;
+  }
+
+  private class DState {
+    private final int[] nfaStates;
+    private int[] transitions;
+    private final int hashCode;
+    private final boolean isAccept;
+
+    private DState(int[] nfaStates) {
+      assert nfaStates != null && nfaStates.length > 0;
+      this.nfaStates = nfaStates;
+      int hashCode = nfaStates.length;
+      boolean isAccept = false;
+      for (int s : nfaStates) {
+        hashCode += BitMixer.mix(s);
+        if (automaton.isAccept(s)) {
+          isAccept = true;
+        }
+      }
+      this.isAccept = isAccept;
+      this.hashCode = hashCode;
+    }
+
+    private int nextState(int charClass) {
+      initTransitions();
+      assert charClass < transitions.length;
+      return transitions[charClass];

Review comment:
       Maybe inside here, if the result is `NOT_COMPUTED`, we should go and lazily compute it, instead of expecting caller to?  And then remove `setNextState` (or make it private)?

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##########
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA
+ * state along with the run
+ *
+ * <p>implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map<DState, Integer> dStateToOrd = new HashMap<>(); // could init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+
+  /**
+   * Constructor, assuming alphabet size is the whole codepoint space
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} for
+   *     better efficiency
+   */
+  public NFARunAutomaton(Automaton automaton) {
+    this(automaton, Character.MAX_CODE_POINT);
+  }
+
+  /**
+   * Constructor
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} *
+   *     for better efficiency
+   * @param alphabetSize alphabet size
+   */
+  public NFARunAutomaton(Automaton automaton, int alphabetSize) {
+    this.automaton = automaton;
+    points = automaton.getStartPoints();
+    this.alphabetSize = alphabetSize;
+    dStates = new DState[10];
+    findDState(new DState(new int[] {0}));
+  }
+
+  /**
+   * For a given state and an incoming character (codepoint), return the next state
+   *
+   * @param state incoming state, should either be 0 or some state that is returned previously by
+   *     this function
+   * @param c codepoint
+   * @return the next state or {@link #MISSING} if the transition doesn't exist
+   */
+  public int step(int state, int c) {
+    assert dStates[state] != null;
+    return step(dStates[state], c);
+  }
+
+  /**
+   * Run through a given codepoint array, return accepted or not, should only be used in test
+   *
+   * @param s String represented by an int array
+   * @return accept or not
+   */
+  boolean run(int[] s) {
+    int p = 0;
+    for (int c : s) {
+      p = step(p, c);
+      if (p == MISSING) return false;
+    }
+    return dStates[p].isAccept;
+  }
+
+  /**
+   * From an existing DFA state, step to next DFA state given character c if the transition is
+   * previously tried then this operation will just use the cached result, otherwise it will call
+   * {@link #step(int[], int)} to get the next state and cache the result
+   */
+  private int step(DState dState, int c) {
+    int charClass = getCharClass(c);
+    if (dState.nextState(charClass) == NOT_COMPUTED) {
+      // the next dfa state has not been computed yet
+      dState.setNextState(charClass, findDState(step(dState.nfaStates, c)));
+    }
+    return dState.nextState(charClass);
+  }
+
+  /**
+   * given a list of NFA states and a character c, compute the output list of NFA state which is
+   * wrapped as a DFA state
+   */
+  private DState step(int[] nfaStates, int c) {
+    Transition transition = new Transition();
+    StateSet stateSet = new StateSet(5); // fork IntHashSet from hppc instead?
+    int numTransitions;
+    for (int nfaState : nfaStates) {
+      numTransitions = automaton.initTransition(nfaState, transition);
+      for (int i = 0; i < numTransitions; i++) {
+        automaton.getNextTransition(transition);
+        if (transition.min <= c && transition.max >= c) {
+          stateSet.incr(transition.dest);
+        }
+      }
+    }
+    if (stateSet.size() == 0) {
+      return null;
+    }
+    return new DState(stateSet.getArray());
+  }
+
+  /**
+   * return the ordinal of given DFA state, generate a new ordinal if the given DFA state is a new
+   * one
+   */
+  private int findDState(DState dState) {
+    if (dState == null) {
+      return MISSING;
+    }
+    int ord = dStateToOrd.getOrDefault(dState, -1);
+    if (ord >= 0) {
+      return ord;
+    }
+    ord = dStateToOrd.size();
+    dStateToOrd.put(dState, ord);
+    assert ord >= dStates.length || dStates[ord] == null;
+    if (ord >= dStates.length) {
+      dStates = ArrayUtil.grow(dStates, ord + 1);
+    }
+    dStates[ord] = dState;
+    return ord;
+  }
+
+  /** Gets character class of given codepoint */
+  final int getCharClass(int c) {
+    assert c < alphabetSize;
+    // binary search
+    int a = 0;
+    int b = points.length;
+    while (b - a > 1) {
+      int d = (a + b) >>> 1;
+      if (points[d] > c) b = d;
+      else if (points[d] < c) a = d;
+      else return d;
+    }
+    return a;
+  }
+
+  private class DState {

Review comment:
       Does this represent the same thing as `FrozenIntSet`?  I.e. it is a powerset, a subset of the NFA states that logically represent one state in the DFA?  But we couldn't re-use the `FrozenIntSet`?

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##########
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA
+ * state along with the run
+ *
+ * <p>implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map<DState, Integer> dStateToOrd = new HashMap<>(); // could init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+
+  /**
+   * Constructor, assuming alphabet size is the whole codepoint space
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} for
+   *     better efficiency
+   */
+  public NFARunAutomaton(Automaton automaton) {
+    this(automaton, Character.MAX_CODE_POINT);
+  }
+
+  /**
+   * Constructor
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} *
+   *     for better efficiency
+   * @param alphabetSize alphabet size
+   */
+  public NFARunAutomaton(Automaton automaton, int alphabetSize) {
+    this.automaton = automaton;
+    points = automaton.getStartPoints();
+    this.alphabetSize = alphabetSize;
+    dStates = new DState[10];
+    findDState(new DState(new int[] {0}));
+  }
+
+  /**
+   * For a given state and an incoming character (codepoint), return the next state
+   *
+   * @param state incoming state, should either be 0 or some state that is returned previously by
+   *     this function
+   * @param c codepoint
+   * @return the next state or {@link #MISSING} if the transition doesn't exist
+   */
+  public int step(int state, int c) {
+    assert dStates[state] != null;
+    return step(dStates[state], c);
+  }
+
+  /**
+   * Run through a given codepoint array, return accepted or not, should only be used in test
+   *
+   * @param s String represented by an int array
+   * @return accept or not
+   */
+  boolean run(int[] s) {
+    int p = 0;
+    for (int c : s) {
+      p = step(p, c);
+      if (p == MISSING) return false;
+    }
+    return dStates[p].isAccept;
+  }
+
+  /**
+   * From an existing DFA state, step to next DFA state given character c if the transition is
+   * previously tried then this operation will just use the cached result, otherwise it will call
+   * {@link #step(int[], int)} to get the next state and cache the result
+   */
+  private int step(DState dState, int c) {
+    int charClass = getCharClass(c);
+    if (dState.nextState(charClass) == NOT_COMPUTED) {
+      // the next dfa state has not been computed yet
+      dState.setNextState(charClass, findDState(step(dState.nfaStates, c)));
+    }
+    return dState.nextState(charClass);
+  }
+
+  /**
+   * given a list of NFA states and a character c, compute the output list of NFA state which is
+   * wrapped as a DFA state
+   */
+  private DState step(int[] nfaStates, int c) {
+    Transition transition = new Transition();
+    StateSet stateSet = new StateSet(5); // fork IntHashSet from hppc instead?
+    int numTransitions;
+    for (int nfaState : nfaStates) {
+      numTransitions = automaton.initTransition(nfaState, transition);
+      for (int i = 0; i < numTransitions; i++) {
+        automaton.getNextTransition(transition);
+        if (transition.min <= c && transition.max >= c) {
+          stateSet.incr(transition.dest);
+        }
+      }
+    }
+    if (stateSet.size() == 0) {
+      return null;
+    }
+    return new DState(stateSet.getArray());
+  }
+
+  /**
+   * return the ordinal of given DFA state, generate a new ordinal if the given DFA state is a new
+   * one
+   */
+  private int findDState(DState dState) {
+    if (dState == null) {
+      return MISSING;
+    }
+    int ord = dStateToOrd.getOrDefault(dState, -1);
+    if (ord >= 0) {
+      return ord;
+    }
+    ord = dStateToOrd.size();
+    dStateToOrd.put(dState, ord);
+    assert ord >= dStates.length || dStates[ord] == null;
+    if (ord >= dStates.length) {
+      dStates = ArrayUtil.grow(dStates, ord + 1);
+    }
+    dStates[ord] = dState;
+    return ord;
+  }
+
+  /** Gets character class of given codepoint */
+  final int getCharClass(int c) {

Review comment:
       Can we somehow use the version in `RunAutomaton` directly?

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##########
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA
+ * state along with the run
+ *
+ * <p>implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map<DState, Integer> dStateToOrd = new HashMap<>(); // could init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+
+  /**
+   * Constructor, assuming alphabet size is the whole codepoint space
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} for
+   *     better efficiency
+   */
+  public NFARunAutomaton(Automaton automaton) {
+    this(automaton, Character.MAX_CODE_POINT);
+  }
+
+  /**
+   * Constructor
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} *
+   *     for better efficiency
+   * @param alphabetSize alphabet size
+   */
+  public NFARunAutomaton(Automaton automaton, int alphabetSize) {
+    this.automaton = automaton;
+    points = automaton.getStartPoints();
+    this.alphabetSize = alphabetSize;
+    dStates = new DState[10];
+    findDState(new DState(new int[] {0}));
+  }
+
+  /**
+   * For a given state and an incoming character (codepoint), return the next state
+   *
+   * @param state incoming state, should either be 0 or some state that is returned previously by
+   *     this function
+   * @param c codepoint
+   * @return the next state or {@link #MISSING} if the transition doesn't exist
+   */
+  public int step(int state, int c) {
+    assert dStates[state] != null;
+    return step(dStates[state], c);
+  }
+
+  /**
+   * Run through a given codepoint array, return accepted or not, should only be used in test
+   *
+   * @param s String represented by an int array
+   * @return accept or not
+   */
+  boolean run(int[] s) {
+    int p = 0;
+    for (int c : s) {
+      p = step(p, c);
+      if (p == MISSING) return false;
+    }
+    return dStates[p].isAccept;
+  }
+
+  /**
+   * From an existing DFA state, step to next DFA state given character c if the transition is
+   * previously tried then this operation will just use the cached result, otherwise it will call
+   * {@link #step(int[], int)} to get the next state and cache the result
+   */
+  private int step(DState dState, int c) {
+    int charClass = getCharClass(c);
+    if (dState.nextState(charClass) == NOT_COMPUTED) {
+      // the next dfa state has not been computed yet
+      dState.setNextState(charClass, findDState(step(dState.nfaStates, c)));
+    }
+    return dState.nextState(charClass);
+  }
+
+  /**
+   * given a list of NFA states and a character c, compute the output list of NFA state which is
+   * wrapped as a DFA state
+   */
+  private DState step(int[] nfaStates, int c) {
+    Transition transition = new Transition();
+    StateSet stateSet = new StateSet(5); // fork IntHashSet from hppc instead?
+    int numTransitions;
+    for (int nfaState : nfaStates) {
+      numTransitions = automaton.initTransition(nfaState, transition);
+      for (int i = 0; i < numTransitions; i++) {
+        automaton.getNextTransition(transition);
+        if (transition.min <= c && transition.max >= c) {
+          stateSet.incr(transition.dest);
+        }
+      }
+    }
+    if (stateSet.size() == 0) {
+      return null;
+    }
+    return new DState(stateSet.getArray());

Review comment:
       We can't just use `StateSet.freeze` here?

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##########
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA
+ * state along with the run
+ *
+ * <p>implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map<DState, Integer> dStateToOrd = new HashMap<>(); // could init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+
+  /**
+   * Constructor, assuming alphabet size is the whole codepoint space
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} for
+   *     better efficiency
+   */
+  public NFARunAutomaton(Automaton automaton) {
+    this(automaton, Character.MAX_CODE_POINT);
+  }
+
+  /**
+   * Constructor
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} *
+   *     for better efficiency
+   * @param alphabetSize alphabet size
+   */
+  public NFARunAutomaton(Automaton automaton, int alphabetSize) {
+    this.automaton = automaton;
+    points = automaton.getStartPoints();
+    this.alphabetSize = alphabetSize;
+    dStates = new DState[10];
+    findDState(new DState(new int[] {0}));
+  }
+
+  /**
+   * For a given state and an incoming character (codepoint), return the next state
+   *
+   * @param state incoming state, should either be 0 or some state that is returned previously by
+   *     this function
+   * @param c codepoint
+   * @return the next state or {@link #MISSING} if the transition doesn't exist
+   */
+  public int step(int state, int c) {
+    assert dStates[state] != null;
+    return step(dStates[state], c);
+  }
+
+  /**
+   * Run through a given codepoint array, return accepted or not, should only be used in test
+   *
+   * @param s String represented by an int array
+   * @return accept or not
+   */
+  boolean run(int[] s) {

Review comment:
       Could we share this with `RunAutomaton`?

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##########
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA
+ * state along with the run
+ *
+ * <p>implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map<DState, Integer> dStateToOrd = new HashMap<>(); // could init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+
+  /**
+   * Constructor, assuming alphabet size is the whole codepoint space
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} for
+   *     better efficiency
+   */
+  public NFARunAutomaton(Automaton automaton) {
+    this(automaton, Character.MAX_CODE_POINT);
+  }
+
+  /**
+   * Constructor
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} *
+   *     for better efficiency
+   * @param alphabetSize alphabet size
+   */
+  public NFARunAutomaton(Automaton automaton, int alphabetSize) {
+    this.automaton = automaton;
+    points = automaton.getStartPoints();
+    this.alphabetSize = alphabetSize;
+    dStates = new DState[10];
+    findDState(new DState(new int[] {0}));
+  }
+
+  /**
+   * For a given state and an incoming character (codepoint), return the next state
+   *
+   * @param state incoming state, should either be 0 or some state that is returned previously by
+   *     this function
+   * @param c codepoint
+   * @return the next state or {@link #MISSING} if the transition doesn't exist
+   */
+  public int step(int state, int c) {
+    assert dStates[state] != null;
+    return step(dStates[state], c);
+  }
+
+  /**
+   * Run through a given codepoint array, return accepted or not, should only be used in test
+   *
+   * @param s String represented by an int array
+   * @return accept or not
+   */
+  boolean run(int[] s) {
+    int p = 0;
+    for (int c : s) {
+      p = step(p, c);
+      if (p == MISSING) return false;
+    }
+    return dStates[p].isAccept;
+  }
+
+  /**
+   * From an existing DFA state, step to next DFA state given character c if the transition is
+   * previously tried then this operation will just use the cached result, otherwise it will call
+   * {@link #step(int[], int)} to get the next state and cache the result
+   */
+  private int step(DState dState, int c) {
+    int charClass = getCharClass(c);
+    if (dState.nextState(charClass) == NOT_COMPUTED) {
+      // the next dfa state has not been computed yet
+      dState.setNextState(charClass, findDState(step(dState.nfaStates, c)));
+    }
+    return dState.nextState(charClass);
+  }
+
+  /**
+   * given a list of NFA states and a character c, compute the output list of NFA state which is
+   * wrapped as a DFA state
+   */
+  private DState step(int[] nfaStates, int c) {
+    Transition transition = new Transition();
+    StateSet stateSet = new StateSet(5); // fork IntHashSet from hppc instead?
+    int numTransitions;
+    for (int nfaState : nfaStates) {
+      numTransitions = automaton.initTransition(nfaState, transition);
+      for (int i = 0; i < numTransitions; i++) {

Review comment:
       This is perhaps quite costly in some cases?  We are having to step through all transitions for every NFA state in this `DState`, scanning to find ones that include the character we are building a transition for.  But I don't see any better way -- `Automaton` doesn't have any inverted index to map a transition character to the matching transitions for a given state.  And in practice this is likely fine: most NFA states do not have so many outgoing transitions. Though the are surely adversaries ...

##########
File path: lucene/core/src/test/org/apache/lucene/util/automaton/TestNFARunAutomaton.java
##########
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import org.apache.lucene.util.IntsRef;
+import org.apache.lucene.util.LuceneTestCase;
+
+public class TestNFARunAutomaton extends LuceneTestCase {
+
+  public void testRandom() {
+    for (int i = 0; i < 100; i++) {
+      RegExp regExp = null;
+      while (regExp == null) {
+        try {
+          regExp = new RegExp(AutomatonTestUtil.randomRegexp(random()));
+        } catch (IllegalArgumentException e) {
+          ignoreException(e);
+        }
+      }
+      Automaton dfa = regExp.toAutomaton();

Review comment:
       Hmm should we rename this existing method to `toDFA()` maybe?

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##########
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA
+ * state along with the run
+ *
+ * <p>implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map<DState, Integer> dStateToOrd = new HashMap<>(); // could init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+
+  /**
+   * Constructor, assuming alphabet size is the whole codepoint space
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} for
+   *     better efficiency
+   */
+  public NFARunAutomaton(Automaton automaton) {
+    this(automaton, Character.MAX_CODE_POINT);
+  }
+
+  /**
+   * Constructor
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} *
+   *     for better efficiency
+   * @param alphabetSize alphabet size
+   */
+  public NFARunAutomaton(Automaton automaton, int alphabetSize) {
+    this.automaton = automaton;
+    points = automaton.getStartPoints();
+    this.alphabetSize = alphabetSize;
+    dStates = new DState[10];
+    findDState(new DState(new int[] {0}));
+  }
+
+  /**
+   * For a given state and an incoming character (codepoint), return the next state
+   *
+   * @param state incoming state, should either be 0 or some state that is returned previously by
+   *     this function
+   * @param c codepoint
+   * @return the next state or {@link #MISSING} if the transition doesn't exist
+   */
+  public int step(int state, int c) {
+    assert dStates[state] != null;
+    return step(dStates[state], c);
+  }
+
+  /**
+   * Run through a given codepoint array, return accepted or not, should only be used in test
+   *
+   * @param s String represented by an int array
+   * @return accept or not
+   */
+  boolean run(int[] s) {
+    int p = 0;
+    for (int c : s) {
+      p = step(p, c);
+      if (p == MISSING) return false;
+    }
+    return dStates[p].isAccept;
+  }
+
+  /**
+   * From an existing DFA state, step to next DFA state given character c if the transition is
+   * previously tried then this operation will just use the cached result, otherwise it will call
+   * {@link #step(int[], int)} to get the next state and cache the result
+   */
+  private int step(DState dState, int c) {
+    int charClass = getCharClass(c);
+    if (dState.nextState(charClass) == NOT_COMPUTED) {
+      // the next dfa state has not been computed yet
+      dState.setNextState(charClass, findDState(step(dState.nfaStates, c)));
+    }
+    return dState.nextState(charClass);
+  }
+
+  /**
+   * given a list of NFA states and a character c, compute the output list of NFA state which is
+   * wrapped as a DFA state
+   */
+  private DState step(int[] nfaStates, int c) {
+    Transition transition = new Transition();
+    StateSet stateSet = new StateSet(5); // fork IntHashSet from hppc instead?
+    int numTransitions;
+    for (int nfaState : nfaStates) {
+      numTransitions = automaton.initTransition(nfaState, transition);
+      for (int i = 0; i < numTransitions; i++) {
+        automaton.getNextTransition(transition);
+        if (transition.min <= c && transition.max >= c) {
+          stateSet.incr(transition.dest);
+        }
+      }
+    }
+    if (stateSet.size() == 0) {
+      return null;
+    }
+    return new DState(stateSet.getArray());
+  }
+
+  /**
+   * return the ordinal of given DFA state, generate a new ordinal if the given DFA state is a new
+   * one
+   */
+  private int findDState(DState dState) {
+    if (dState == null) {
+      return MISSING;
+    }
+    int ord = dStateToOrd.getOrDefault(dState, -1);
+    if (ord >= 0) {
+      return ord;
+    }
+    ord = dStateToOrd.size();
+    dStateToOrd.put(dState, ord);
+    assert ord >= dStates.length || dStates[ord] == null;
+    if (ord >= dStates.length) {
+      dStates = ArrayUtil.grow(dStates, ord + 1);
+    }
+    dStates[ord] = dState;
+    return ord;
+  }
+
+  /** Gets character class of given codepoint */
+  final int getCharClass(int c) {
+    assert c < alphabetSize;
+    // binary search
+    int a = 0;
+    int b = points.length;
+    while (b - a > 1) {
+      int d = (a + b) >>> 1;
+      if (points[d] > c) b = d;
+      else if (points[d] < c) a = d;
+      else return d;
+    }
+    return a;
+  }
+
+  private class DState {
+    private final int[] nfaStates;
+    private int[] transitions;
+    private final int hashCode;
+    private final boolean isAccept;
+
+    private DState(int[] nfaStates) {
+      assert nfaStates != null && nfaStates.length > 0;
+      this.nfaStates = nfaStates;
+      int hashCode = nfaStates.length;
+      boolean isAccept = false;
+      for (int s : nfaStates) {
+        hashCode += BitMixer.mix(s);
+        if (automaton.isAccept(s)) {
+          isAccept = true;
+        }
+      }
+      this.isAccept = isAccept;
+      this.hashCode = hashCode;
+    }
+
+    private int nextState(int charClass) {
+      initTransitions();
+      assert charClass < transitions.length;
+      return transitions[charClass];
+    }
+
+    private void setNextState(int charClass, int nextState) {
+      initTransitions();
+      assert charClass < transitions.length;
+      transitions[charClass] = nextState;

Review comment:
       Maybe also `assert transitions[charClass] == NOT_COMPUTED` so we confirm we are not double-computing transitions or something?

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##########
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA
+ * state along with the run
+ *
+ * <p>implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map<DState, Integer> dStateToOrd = new HashMap<>(); // could init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+
+  /**
+   * Constructor, assuming alphabet size is the whole codepoint space
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} for
+   *     better efficiency
+   */
+  public NFARunAutomaton(Automaton automaton) {
+    this(automaton, Character.MAX_CODE_POINT);
+  }
+
+  /**
+   * Constructor
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} *
+   *     for better efficiency
+   * @param alphabetSize alphabet size
+   */
+  public NFARunAutomaton(Automaton automaton, int alphabetSize) {
+    this.automaton = automaton;
+    points = automaton.getStartPoints();
+    this.alphabetSize = alphabetSize;
+    dStates = new DState[10];
+    findDState(new DState(new int[] {0}));
+  }
+
+  /**
+   * For a given state and an incoming character (codepoint), return the next state
+   *
+   * @param state incoming state, should either be 0 or some state that is returned previously by
+   *     this function
+   * @param c codepoint
+   * @return the next state or {@link #MISSING} if the transition doesn't exist
+   */
+  public int step(int state, int c) {
+    assert dStates[state] != null;
+    return step(dStates[state], c);
+  }
+
+  /**
+   * Run through a given codepoint array, return accepted or not, should only be used in test
+   *
+   * @param s String represented by an int array
+   * @return accept or not
+   */
+  boolean run(int[] s) {
+    int p = 0;
+    for (int c : s) {
+      p = step(p, c);
+      if (p == MISSING) return false;
+    }
+    return dStates[p].isAccept;
+  }
+
+  /**
+   * From an existing DFA state, step to next DFA state given character c if the transition is
+   * previously tried then this operation will just use the cached result, otherwise it will call
+   * {@link #step(int[], int)} to get the next state and cache the result
+   */
+  private int step(DState dState, int c) {
+    int charClass = getCharClass(c);
+    if (dState.nextState(charClass) == NOT_COMPUTED) {
+      // the next dfa state has not been computed yet
+      dState.setNextState(charClass, findDState(step(dState.nfaStates, c)));
+    }
+    return dState.nextState(charClass);
+  }
+
+  /**
+   * given a list of NFA states and a character c, compute the output list of NFA state which is
+   * wrapped as a DFA state
+   */
+  private DState step(int[] nfaStates, int c) {
+    Transition transition = new Transition();
+    StateSet stateSet = new StateSet(5); // fork IntHashSet from hppc instead?
+    int numTransitions;
+    for (int nfaState : nfaStates) {
+      numTransitions = automaton.initTransition(nfaState, transition);
+      for (int i = 0; i < numTransitions; i++) {
+        automaton.getNextTransition(transition);
+        if (transition.min <= c && transition.max >= c) {
+          stateSet.incr(transition.dest);
+        }
+      }
+    }
+    if (stateSet.size() == 0) {
+      return null;
+    }
+    return new DState(stateSet.getArray());
+  }
+
+  /**
+   * return the ordinal of given DFA state, generate a new ordinal if the given DFA state is a new
+   * one
+   */
+  private int findDState(DState dState) {
+    if (dState == null) {

Review comment:
       Hmm why would `dState` ever be null like that?

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##########
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA
+ * state along with the run
+ *
+ * <p>implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map<DState, Integer> dStateToOrd = new HashMap<>(); // could init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+
+  /**
+   * Constructor, assuming alphabet size is the whole codepoint space
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} for
+   *     better efficiency
+   */
+  public NFARunAutomaton(Automaton automaton) {
+    this(automaton, Character.MAX_CODE_POINT);
+  }
+
+  /**
+   * Constructor
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} *
+   *     for better efficiency
+   * @param alphabetSize alphabet size
+   */
+  public NFARunAutomaton(Automaton automaton, int alphabetSize) {
+    this.automaton = automaton;
+    points = automaton.getStartPoints();
+    this.alphabetSize = alphabetSize;
+    dStates = new DState[10];
+    findDState(new DState(new int[] {0}));
+  }
+
+  /**
+   * For a given state and an incoming character (codepoint), return the next state
+   *
+   * @param state incoming state, should either be 0 or some state that is returned previously by
+   *     this function
+   * @param c codepoint
+   * @return the next state or {@link #MISSING} if the transition doesn't exist
+   */
+  public int step(int state, int c) {
+    assert dStates[state] != null;
+    return step(dStates[state], c);
+  }
+
+  /**
+   * Run through a given codepoint array, return accepted or not, should only be used in test
+   *
+   * @param s String represented by an int array
+   * @return accept or not
+   */
+  boolean run(int[] s) {
+    int p = 0;
+    for (int c : s) {
+      p = step(p, c);
+      if (p == MISSING) return false;
+    }
+    return dStates[p].isAccept;
+  }
+
+  /**
+   * From an existing DFA state, step to next DFA state given character c if the transition is
+   * previously tried then this operation will just use the cached result, otherwise it will call
+   * {@link #step(int[], int)} to get the next state and cache the result
+   */
+  private int step(DState dState, int c) {
+    int charClass = getCharClass(c);
+    if (dState.nextState(charClass) == NOT_COMPUTED) {
+      // the next dfa state has not been computed yet
+      dState.setNextState(charClass, findDState(step(dState.nfaStates, c)));
+    }
+    return dState.nextState(charClass);
+  }
+
+  /**
+   * given a list of NFA states and a character c, compute the output list of NFA state which is
+   * wrapped as a DFA state
+   */
+  private DState step(int[] nfaStates, int c) {
+    Transition transition = new Transition();
+    StateSet stateSet = new StateSet(5); // fork IntHashSet from hppc instead?
+    int numTransitions;
+    for (int nfaState : nfaStates) {
+      numTransitions = automaton.initTransition(nfaState, transition);
+      for (int i = 0; i < numTransitions; i++) {
+        automaton.getNextTransition(transition);
+        if (transition.min <= c && transition.max >= c) {
+          stateSet.incr(transition.dest);
+        }
+      }
+    }
+    if (stateSet.size() == 0) {
+      return null;
+    }
+    return new DState(stateSet.getArray());
+  }
+
+  /**
+   * return the ordinal of given DFA state, generate a new ordinal if the given DFA state is a new
+   * one
+   */
+  private int findDState(DState dState) {
+    if (dState == null) {
+      return MISSING;
+    }
+    int ord = dStateToOrd.getOrDefault(dState, -1);
+    if (ord >= 0) {
+      return ord;
+    }
+    ord = dStateToOrd.size();
+    dStateToOrd.put(dState, ord);
+    assert ord >= dStates.length || dStates[ord] == null;
+    if (ord >= dStates.length) {
+      dStates = ArrayUtil.grow(dStates, ord + 1);
+    }
+    dStates[ord] = dState;
+    return ord;
+  }
+
+  /** Gets character class of given codepoint */
+  final int getCharClass(int c) {
+    assert c < alphabetSize;
+    // binary search
+    int a = 0;
+    int b = points.length;
+    while (b - a > 1) {
+      int d = (a + b) >>> 1;
+      if (points[d] > c) b = d;
+      else if (points[d] < c) a = d;
+      else return d;
+    }
+    return a;
+  }
+
+  private class DState {
+    private final int[] nfaStates;
+    private int[] transitions;

Review comment:
       Maybe add comment that this is lazy-init'd first time caller wants transition for a given character?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

rmuir commented on pull request #225:
URL: https://github.com/apache/lucene/pull/225#issuecomment-981447584


   I made a quick prototype with what i mean for the API: https://github.com/apache/lucene/pull/485
   
   The idea is that AutomatonQuery shouldn't be determinizing. Let's push this to the caller. If they pass it a DFA, it uses DFA algorithm. If they pass it NFA, it can use the NFA algorithm (it currently throws an exception in my branch, instead of slowly determinizing, that is the change).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mikemccand commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

mikemccand commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r725016724



##########
File path: lucene/core/src/test/org/apache/lucene/util/automaton/TestNFARunAutomaton.java
##########
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import org.apache.lucene.util.IntsRef;
+import org.apache.lucene.util.LuceneTestCase;
+
+public class TestNFARunAutomaton extends LuceneTestCase {
+
+  public void testRandom() {
+    for (int i = 0; i < 100; i++) {
+      RegExp regExp = null;
+      while (regExp == null) {
+        try {
+          regExp = new RegExp(AutomatonTestUtil.randomRegexp(random()));
+        } catch (IllegalArgumentException e) {
+          ignoreException(e);

Review comment:
       Yeah +1 to pursue that separately/later.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] zhaih commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

zhaih commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r724698226



##########
File path: lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene40/blocktree/FieldReader.java
##########
@@ -187,6 +187,14 @@ public TermsEnum intersect(CompiledAutomaton compiled, BytesRef startTerm) throw
     if (compiled.type != CompiledAutomaton.AUTOMATON_TYPE.NORMAL) {
       throw new IllegalArgumentException("please use CompiledAutomaton.getTermsEnum instead");
     }
+    if (compiled.nfaRunAutomaton != null) {
+      return new IntersectTermsEnum(

Review comment:
       Hmmm? I don't understand this comment?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] zhaih commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

zhaih commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r677669213



##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##########
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA
+ * state along with the run
+ *
+ * <p>implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map<DState, Integer> dStateToOrd = new HashMap<>(); // could init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+
+  /**
+   * Constructor, assuming alphabet size is the whole codepoint space

Review comment:
       ++

##########
File path: lucene/core/src/test/org/apache/lucene/util/automaton/TestNFARunAutomaton.java
##########
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import org.apache.lucene.util.IntsRef;
+import org.apache.lucene.util.LuceneTestCase;
+
+public class TestNFARunAutomaton extends LuceneTestCase {
+
+  public void testRandom() {
+    for (int i = 0; i < 100; i++) {
+      RegExp regExp = null;
+      while (regExp == null) {
+        try {
+          regExp = new RegExp(AutomatonTestUtil.randomRegexp(random()));
+        } catch (IllegalArgumentException e) {
+          ignoreException(e);
+        }
+      }
+      Automaton dfa = regExp.toAutomaton();

Review comment:
       Yeah I think that's better, will change it.

##########
File path: lucene/core/src/test/org/apache/lucene/util/automaton/TestNFARunAutomaton.java
##########
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import org.apache.lucene.util.IntsRef;
+import org.apache.lucene.util.LuceneTestCase;
+
+public class TestNFARunAutomaton extends LuceneTestCase {
+
+  public void testRandom() {
+    for (int i = 0; i < 100; i++) {
+      RegExp regExp = null;
+      while (regExp == null) {
+        try {
+          regExp = new RegExp(AutomatonTestUtil.randomRegexp(random()));
+        } catch (IllegalArgumentException e) {
+          ignoreException(e);

Review comment:
       Yeah ecj is complaining that :( I haven't dived too deep into why the `randomRegexp` generates invalid `RegExp`, maybe I'll just open another issue to fix that behavior?

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##########
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA
+ * state along with the run
+ *
+ * <p>implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map<DState, Integer> dStateToOrd = new HashMap<>(); // could init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+
+  /**
+   * Constructor, assuming alphabet size is the whole codepoint space
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} for
+   *     better efficiency
+   */
+  public NFARunAutomaton(Automaton automaton) {
+    this(automaton, Character.MAX_CODE_POINT);
+  }
+
+  /**
+   * Constructor
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} *
+   *     for better efficiency
+   * @param alphabetSize alphabet size
+   */
+  public NFARunAutomaton(Automaton automaton, int alphabetSize) {
+    this.automaton = automaton;
+    points = automaton.getStartPoints();
+    this.alphabetSize = alphabetSize;
+    dStates = new DState[10];
+    findDState(new DState(new int[] {0}));
+  }
+
+  /**
+   * For a given state and an incoming character (codepoint), return the next state
+   *
+   * @param state incoming state, should either be 0 or some state that is returned previously by
+   *     this function
+   * @param c codepoint
+   * @return the next state or {@link #MISSING} if the transition doesn't exist
+   */
+  public int step(int state, int c) {
+    assert dStates[state] != null;
+    return step(dStates[state], c);
+  }
+
+  /**
+   * Run through a given codepoint array, return accepted or not, should only be used in test
+   *
+   * @param s String represented by an int array
+   * @return accept or not
+   */
+  boolean run(int[] s) {
+    int p = 0;
+    for (int c : s) {
+      p = step(p, c);
+      if (p == MISSING) return false;
+    }
+    return dStates[p].isAccept;
+  }
+
+  /**
+   * From an existing DFA state, step to next DFA state given character c if the transition is
+   * previously tried then this operation will just use the cached result, otherwise it will call
+   * {@link #step(int[], int)} to get the next state and cache the result
+   */
+  private int step(DState dState, int c) {
+    int charClass = getCharClass(c);
+    if (dState.nextState(charClass) == NOT_COMPUTED) {
+      // the next dfa state has not been computed yet
+      dState.setNextState(charClass, findDState(step(dState.nfaStates, c)));
+    }
+    return dState.nextState(charClass);
+  }
+
+  /**
+   * given a list of NFA states and a character c, compute the output list of NFA state which is
+   * wrapped as a DFA state
+   */
+  private DState step(int[] nfaStates, int c) {
+    Transition transition = new Transition();
+    StateSet stateSet = new StateSet(5); // fork IntHashSet from hppc instead?
+    int numTransitions;
+    for (int nfaState : nfaStates) {
+      numTransitions = automaton.initTransition(nfaState, transition);
+      for (int i = 0; i < numTransitions; i++) {
+        automaton.getNextTransition(transition);
+        if (transition.min <= c && transition.max >= c) {
+          stateSet.incr(transition.dest);
+        }
+      }
+    }
+    if (stateSet.size() == 0) {
+      return null;
+    }
+    return new DState(stateSet.getArray());
+  }
+
+  /**
+   * return the ordinal of given DFA state, generate a new ordinal if the given DFA state is a new
+   * one
+   */
+  private int findDState(DState dState) {
+    if (dState == null) {
+      return MISSING;
+    }
+    int ord = dStateToOrd.getOrDefault(dState, -1);
+    if (ord >= 0) {
+      return ord;
+    }
+    ord = dStateToOrd.size();
+    dStateToOrd.put(dState, ord);
+    assert ord >= dStates.length || dStates[ord] == null;
+    if (ord >= dStates.length) {
+      dStates = ArrayUtil.grow(dStates, ord + 1);
+    }
+    dStates[ord] = dState;
+    return ord;
+  }
+
+  /** Gets character class of given codepoint */
+  final int getCharClass(int c) {
+    assert c < alphabetSize;
+    // binary search
+    int a = 0;
+    int b = points.length;
+    while (b - a > 1) {
+      int d = (a + b) >>> 1;
+      if (points[d] > c) b = d;
+      else if (points[d] < c) a = d;
+      else return d;
+    }
+    return a;
+  }
+
+  private class DState {
+    private final int[] nfaStates;
+    private int[] transitions;

Review comment:
       ++

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##########
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA
+ * state along with the run
+ *
+ * <p>implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map<DState, Integer> dStateToOrd = new HashMap<>(); // could init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+
+  /**
+   * Constructor, assuming alphabet size is the whole codepoint space
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} for
+   *     better efficiency
+   */
+  public NFARunAutomaton(Automaton automaton) {
+    this(automaton, Character.MAX_CODE_POINT);
+  }
+
+  /**
+   * Constructor
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} *
+   *     for better efficiency
+   * @param alphabetSize alphabet size
+   */
+  public NFARunAutomaton(Automaton automaton, int alphabetSize) {
+    this.automaton = automaton;
+    points = automaton.getStartPoints();
+    this.alphabetSize = alphabetSize;
+    dStates = new DState[10];
+    findDState(new DState(new int[] {0}));
+  }
+
+  /**
+   * For a given state and an incoming character (codepoint), return the next state
+   *
+   * @param state incoming state, should either be 0 or some state that is returned previously by
+   *     this function
+   * @param c codepoint
+   * @return the next state or {@link #MISSING} if the transition doesn't exist
+   */
+  public int step(int state, int c) {
+    assert dStates[state] != null;
+    return step(dStates[state], c);
+  }
+
+  /**
+   * Run through a given codepoint array, return accepted or not, should only be used in test
+   *
+   * @param s String represented by an int array
+   * @return accept or not
+   */
+  boolean run(int[] s) {
+    int p = 0;
+    for (int c : s) {
+      p = step(p, c);
+      if (p == MISSING) return false;
+    }
+    return dStates[p].isAccept;
+  }
+
+  /**
+   * From an existing DFA state, step to next DFA state given character c if the transition is
+   * previously tried then this operation will just use the cached result, otherwise it will call
+   * {@link #step(int[], int)} to get the next state and cache the result
+   */
+  private int step(DState dState, int c) {
+    int charClass = getCharClass(c);
+    if (dState.nextState(charClass) == NOT_COMPUTED) {
+      // the next dfa state has not been computed yet
+      dState.setNextState(charClass, findDState(step(dState.nfaStates, c)));
+    }
+    return dState.nextState(charClass);
+  }
+
+  /**
+   * given a list of NFA states and a character c, compute the output list of NFA state which is
+   * wrapped as a DFA state
+   */
+  private DState step(int[] nfaStates, int c) {
+    Transition transition = new Transition();
+    StateSet stateSet = new StateSet(5); // fork IntHashSet from hppc instead?
+    int numTransitions;
+    for (int nfaState : nfaStates) {
+      numTransitions = automaton.initTransition(nfaState, transition);
+      for (int i = 0; i < numTransitions; i++) {
+        automaton.getNextTransition(transition);
+        if (transition.min <= c && transition.max >= c) {
+          stateSet.incr(transition.dest);
+        }
+      }
+    }
+    if (stateSet.size() == 0) {
+      return null;
+    }
+    return new DState(stateSet.getArray());
+  }
+
+  /**
+   * return the ordinal of given DFA state, generate a new ordinal if the given DFA state is a new
+   * one
+   */
+  private int findDState(DState dState) {
+    if (dState == null) {
+      return MISSING;
+    }
+    int ord = dStateToOrd.getOrDefault(dState, -1);
+    if (ord >= 0) {
+      return ord;
+    }
+    ord = dStateToOrd.size();
+    dStateToOrd.put(dState, ord);
+    assert ord >= dStates.length || dStates[ord] == null;
+    if (ord >= dStates.length) {
+      dStates = ArrayUtil.grow(dStates, ord + 1);
+    }
+    dStates[ord] = dState;
+    return ord;
+  }
+
+  /** Gets character class of given codepoint */
+  final int getCharClass(int c) {
+    assert c < alphabetSize;
+    // binary search
+    int a = 0;
+    int b = points.length;
+    while (b - a > 1) {
+      int d = (a + b) >>> 1;
+      if (points[d] > c) b = d;
+      else if (points[d] < c) a = d;
+      else return d;
+    }
+    return a;
+  }
+
+  private class DState {
+    private final int[] nfaStates;
+    private int[] transitions;
+    private final int hashCode;
+    private final boolean isAccept;
+
+    private DState(int[] nfaStates) {
+      assert nfaStates != null && nfaStates.length > 0;
+      this.nfaStates = nfaStates;
+      int hashCode = nfaStates.length;
+      boolean isAccept = false;
+      for (int s : nfaStates) {
+        hashCode += BitMixer.mix(s);
+        if (automaton.isAccept(s)) {
+          isAccept = true;
+        }
+      }
+      this.isAccept = isAccept;
+      this.hashCode = hashCode;
+    }
+
+    private int nextState(int charClass) {
+      initTransitions();
+      assert charClass < transitions.length;
+      return transitions[charClass];
+    }
+
+    private void setNextState(int charClass, int nextState) {
+      initTransitions();
+      assert charClass < transitions.length;
+      transitions[charClass] = nextState;

Review comment:
       ++

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##########
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA
+ * state along with the run
+ *
+ * <p>implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map<DState, Integer> dStateToOrd = new HashMap<>(); // could init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+
+  /**
+   * Constructor, assuming alphabet size is the whole codepoint space
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} for
+   *     better efficiency
+   */
+  public NFARunAutomaton(Automaton automaton) {
+    this(automaton, Character.MAX_CODE_POINT);
+  }
+
+  /**
+   * Constructor
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} *
+   *     for better efficiency
+   * @param alphabetSize alphabet size
+   */
+  public NFARunAutomaton(Automaton automaton, int alphabetSize) {
+    this.automaton = automaton;
+    points = automaton.getStartPoints();
+    this.alphabetSize = alphabetSize;
+    dStates = new DState[10];
+    findDState(new DState(new int[] {0}));
+  }
+
+  /**
+   * For a given state and an incoming character (codepoint), return the next state
+   *
+   * @param state incoming state, should either be 0 or some state that is returned previously by
+   *     this function
+   * @param c codepoint
+   * @return the next state or {@link #MISSING} if the transition doesn't exist
+   */
+  public int step(int state, int c) {
+    assert dStates[state] != null;
+    return step(dStates[state], c);
+  }
+
+  /**
+   * Run through a given codepoint array, return accepted or not, should only be used in test
+   *
+   * @param s String represented by an int array
+   * @return accept or not
+   */
+  boolean run(int[] s) {
+    int p = 0;
+    for (int c : s) {
+      p = step(p, c);
+      if (p == MISSING) return false;
+    }
+    return dStates[p].isAccept;
+  }
+
+  /**
+   * From an existing DFA state, step to next DFA state given character c if the transition is
+   * previously tried then this operation will just use the cached result, otherwise it will call
+   * {@link #step(int[], int)} to get the next state and cache the result
+   */
+  private int step(DState dState, int c) {
+    int charClass = getCharClass(c);
+    if (dState.nextState(charClass) == NOT_COMPUTED) {
+      // the next dfa state has not been computed yet
+      dState.setNextState(charClass, findDState(step(dState.nfaStates, c)));
+    }
+    return dState.nextState(charClass);
+  }
+
+  /**
+   * given a list of NFA states and a character c, compute the output list of NFA state which is
+   * wrapped as a DFA state
+   */
+  private DState step(int[] nfaStates, int c) {
+    Transition transition = new Transition();
+    StateSet stateSet = new StateSet(5); // fork IntHashSet from hppc instead?
+    int numTransitions;
+    for (int nfaState : nfaStates) {
+      numTransitions = automaton.initTransition(nfaState, transition);
+      for (int i = 0; i < numTransitions; i++) {
+        automaton.getNextTransition(transition);
+        if (transition.min <= c && transition.max >= c) {
+          stateSet.incr(transition.dest);
+        }
+      }
+    }
+    if (stateSet.size() == 0) {
+      return null;
+    }
+    return new DState(stateSet.getArray());
+  }
+
+  /**
+   * return the ordinal of given DFA state, generate a new ordinal if the given DFA state is a new
+   * one
+   */
+  private int findDState(DState dState) {
+    if (dState == null) {
+      return MISSING;
+    }
+    int ord = dStateToOrd.getOrDefault(dState, -1);
+    if (ord >= 0) {
+      return ord;
+    }
+    ord = dStateToOrd.size();
+    dStateToOrd.put(dState, ord);
+    assert ord >= dStates.length || dStates[ord] == null;
+    if (ord >= dStates.length) {
+      dStates = ArrayUtil.grow(dStates, ord + 1);
+    }
+    dStates[ord] = dState;
+    return ord;
+  }
+
+  /** Gets character class of given codepoint */
+  final int getCharClass(int c) {
+    assert c < alphabetSize;
+    // binary search
+    int a = 0;
+    int b = points.length;
+    while (b - a > 1) {
+      int d = (a + b) >>> 1;
+      if (points[d] > c) b = d;
+      else if (points[d] < c) a = d;
+      else return d;
+    }
+    return a;
+  }
+
+  private class DState {

Review comment:
       It's slightly different, basically `DState` owns this `transitions` array that manages the outgoing transitions which `FrozenIntSet` doesn't have. I'm thinking of making `FrozenIntSet` a part of this class so that we can reuse what's there.

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##########
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA
+ * state along with the run
+ *
+ * <p>implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map<DState, Integer> dStateToOrd = new HashMap<>(); // could init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+
+  /**
+   * Constructor, assuming alphabet size is the whole codepoint space
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} for
+   *     better efficiency
+   */
+  public NFARunAutomaton(Automaton automaton) {
+    this(automaton, Character.MAX_CODE_POINT);
+  }
+
+  /**
+   * Constructor
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} *
+   *     for better efficiency
+   * @param alphabetSize alphabet size
+   */
+  public NFARunAutomaton(Automaton automaton, int alphabetSize) {
+    this.automaton = automaton;
+    points = automaton.getStartPoints();
+    this.alphabetSize = alphabetSize;
+    dStates = new DState[10];
+    findDState(new DState(new int[] {0}));
+  }
+
+  /**
+   * For a given state and an incoming character (codepoint), return the next state
+   *
+   * @param state incoming state, should either be 0 or some state that is returned previously by
+   *     this function
+   * @param c codepoint
+   * @return the next state or {@link #MISSING} if the transition doesn't exist
+   */
+  public int step(int state, int c) {
+    assert dStates[state] != null;
+    return step(dStates[state], c);
+  }
+
+  /**
+   * Run through a given codepoint array, return accepted or not, should only be used in test
+   *
+   * @param s String represented by an int array
+   * @return accept or not
+   */
+  boolean run(int[] s) {
+    int p = 0;
+    for (int c : s) {
+      p = step(p, c);
+      if (p == MISSING) return false;
+    }
+    return dStates[p].isAccept;
+  }
+
+  /**
+   * From an existing DFA state, step to next DFA state given character c if the transition is
+   * previously tried then this operation will just use the cached result, otherwise it will call
+   * {@link #step(int[], int)} to get the next state and cache the result
+   */
+  private int step(DState dState, int c) {
+    int charClass = getCharClass(c);
+    if (dState.nextState(charClass) == NOT_COMPUTED) {
+      // the next dfa state has not been computed yet
+      dState.setNextState(charClass, findDState(step(dState.nfaStates, c)));
+    }
+    return dState.nextState(charClass);
+  }
+
+  /**
+   * given a list of NFA states and a character c, compute the output list of NFA state which is
+   * wrapped as a DFA state
+   */
+  private DState step(int[] nfaStates, int c) {
+    Transition transition = new Transition();
+    StateSet stateSet = new StateSet(5); // fork IntHashSet from hppc instead?
+    int numTransitions;
+    for (int nfaState : nfaStates) {
+      numTransitions = automaton.initTransition(nfaState, transition);
+      for (int i = 0; i < numTransitions; i++) {
+        automaton.getNextTransition(transition);
+        if (transition.min <= c && transition.max >= c) {
+          stateSet.incr(transition.dest);
+        }
+      }
+    }
+    if (stateSet.size() == 0) {
+      return null;
+    }
+    return new DState(stateSet.getArray());
+  }
+
+  /**
+   * return the ordinal of given DFA state, generate a new ordinal if the given DFA state is a new
+   * one
+   */
+  private int findDState(DState dState) {
+    if (dState == null) {
+      return MISSING;
+    }
+    int ord = dStateToOrd.getOrDefault(dState, -1);
+    if (ord >= 0) {
+      return ord;
+    }
+    ord = dStateToOrd.size();
+    dStateToOrd.put(dState, ord);
+    assert ord >= dStates.length || dStates[ord] == null;
+    if (ord >= dStates.length) {
+      dStates = ArrayUtil.grow(dStates, ord + 1);
+    }
+    dStates[ord] = dState;
+    return ord;
+  }
+
+  /** Gets character class of given codepoint */
+  final int getCharClass(int c) {

Review comment:
       Yeah we need to refactor `RunAutomaton` a little bit, eventually we should use it directly instead of forking the code.

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/MinimizationOperations.java
##########
@@ -53,6 +53,11 @@ private MinimizationOperations() {}
    *     what to specify.
    */
   public static Automaton minimize(Automaton a, int determinizeWorkLimit) {
+    // nocommit: probably shouldn't set the logic here

Review comment:
       Just a fast way to let me generate a NFA instead of DFA from the `RegExp`, I'm not sure whether set `determinizeWorkLimit` to 0 is a good way or not to indicate that we're going to generate the NFA.

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##########
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA
+ * state along with the run
+ *
+ * <p>implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map<DState, Integer> dStateToOrd = new HashMap<>(); // could init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+
+  /**
+   * Constructor, assuming alphabet size is the whole codepoint space
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} for
+   *     better efficiency
+   */
+  public NFARunAutomaton(Automaton automaton) {
+    this(automaton, Character.MAX_CODE_POINT);
+  }
+
+  /**
+   * Constructor
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} *
+   *     for better efficiency
+   * @param alphabetSize alphabet size
+   */
+  public NFARunAutomaton(Automaton automaton, int alphabetSize) {
+    this.automaton = automaton;
+    points = automaton.getStartPoints();
+    this.alphabetSize = alphabetSize;
+    dStates = new DState[10];
+    findDState(new DState(new int[] {0}));
+  }
+
+  /**
+   * For a given state and an incoming character (codepoint), return the next state
+   *
+   * @param state incoming state, should either be 0 or some state that is returned previously by
+   *     this function
+   * @param c codepoint
+   * @return the next state or {@link #MISSING} if the transition doesn't exist
+   */
+  public int step(int state, int c) {
+    assert dStates[state] != null;
+    return step(dStates[state], c);
+  }
+
+  /**
+   * Run through a given codepoint array, return accepted or not, should only be used in test
+   *
+   * @param s String represented by an int array
+   * @return accept or not
+   */
+  boolean run(int[] s) {

Review comment:
       Probably yes and probably no? Current structure of `RunAutomaton` hierarchy is `RunAutomaton` has only `step` method while the children own their own `run` method, I haven't figured out a good way to incorporate this new NFA to the current structure. 
   
   One idea is to let the children, such as `ByteRunAutomaton` not to extend `RunAutomaton` but use it instead, so it can control which `RunAutomaton` to use (NFA or DFA version).

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##########
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA
+ * state along with the run
+ *
+ * <p>implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map<DState, Integer> dStateToOrd = new HashMap<>(); // could init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+
+  /**
+   * Constructor, assuming alphabet size is the whole codepoint space
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} for
+   *     better efficiency
+   */
+  public NFARunAutomaton(Automaton automaton) {
+    this(automaton, Character.MAX_CODE_POINT);
+  }
+
+  /**
+   * Constructor
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} *
+   *     for better efficiency
+   * @param alphabetSize alphabet size
+   */
+  public NFARunAutomaton(Automaton automaton, int alphabetSize) {
+    this.automaton = automaton;
+    points = automaton.getStartPoints();
+    this.alphabetSize = alphabetSize;
+    dStates = new DState[10];
+    findDState(new DState(new int[] {0}));
+  }
+
+  /**
+   * For a given state and an incoming character (codepoint), return the next state
+   *
+   * @param state incoming state, should either be 0 or some state that is returned previously by
+   *     this function
+   * @param c codepoint
+   * @return the next state or {@link #MISSING} if the transition doesn't exist
+   */
+  public int step(int state, int c) {
+    assert dStates[state] != null;
+    return step(dStates[state], c);
+  }
+
+  /**
+   * Run through a given codepoint array, return accepted or not, should only be used in test
+   *
+   * @param s String represented by an int array
+   * @return accept or not
+   */
+  boolean run(int[] s) {
+    int p = 0;
+    for (int c : s) {
+      p = step(p, c);
+      if (p == MISSING) return false;
+    }
+    return dStates[p].isAccept;
+  }
+
+  /**
+   * From an existing DFA state, step to next DFA state given character c if the transition is
+   * previously tried then this operation will just use the cached result, otherwise it will call
+   * {@link #step(int[], int)} to get the next state and cache the result
+   */
+  private int step(DState dState, int c) {
+    int charClass = getCharClass(c);
+    if (dState.nextState(charClass) == NOT_COMPUTED) {
+      // the next dfa state has not been computed yet
+      dState.setNextState(charClass, findDState(step(dState.nfaStates, c)));
+    }
+    return dState.nextState(charClass);
+  }
+
+  /**
+   * given a list of NFA states and a character c, compute the output list of NFA state which is
+   * wrapped as a DFA state
+   */
+  private DState step(int[] nfaStates, int c) {
+    Transition transition = new Transition();
+    StateSet stateSet = new StateSet(5); // fork IntHashSet from hppc instead?
+    int numTransitions;
+    for (int nfaState : nfaStates) {
+      numTransitions = automaton.initTransition(nfaState, transition);
+      for (int i = 0; i < numTransitions; i++) {
+        automaton.getNextTransition(transition);
+        if (transition.min <= c && transition.max >= c) {
+          stateSet.incr(transition.dest);
+        }
+      }
+    }
+    if (stateSet.size() == 0) {
+      return null;
+    }
+    return new DState(stateSet.getArray());
+  }
+
+  /**
+   * return the ordinal of given DFA state, generate a new ordinal if the given DFA state is a new
+   * one
+   */
+  private int findDState(DState dState) {
+    if (dState == null) {
+      return MISSING;
+    }
+    int ord = dStateToOrd.getOrDefault(dState, -1);
+    if (ord >= 0) {
+      return ord;
+    }
+    ord = dStateToOrd.size();
+    dStateToOrd.put(dState, ord);
+    assert ord >= dStates.length || dStates[ord] == null;
+    if (ord >= dStates.length) {
+      dStates = ArrayUtil.grow(dStates, ord + 1);
+    }
+    dStates[ord] = dState;
+    return ord;
+  }
+
+  /** Gets character class of given codepoint */
+  final int getCharClass(int c) {
+    assert c < alphabetSize;
+    // binary search
+    int a = 0;
+    int b = points.length;
+    while (b - a > 1) {
+      int d = (a + b) >>> 1;
+      if (points[d] > c) b = d;
+      else if (points[d] < c) a = d;
+      else return d;
+    }
+    return a;
+  }
+
+  private class DState {
+    private final int[] nfaStates;
+    private int[] transitions;
+    private final int hashCode;
+    private final boolean isAccept;
+
+    private DState(int[] nfaStates) {
+      assert nfaStates != null && nfaStates.length > 0;
+      this.nfaStates = nfaStates;
+      int hashCode = nfaStates.length;
+      boolean isAccept = false;
+      for (int s : nfaStates) {
+        hashCode += BitMixer.mix(s);
+        if (automaton.isAccept(s)) {
+          isAccept = true;
+        }
+      }
+      this.isAccept = isAccept;
+      this.hashCode = hashCode;
+    }
+
+    private int nextState(int charClass) {
+      initTransitions();
+      assert charClass < transitions.length;
+      return transitions[charClass];

Review comment:
       Good idea, will do




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mikemccand commented on pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

mikemccand commented on pull request #225:
URL: https://github.com/apache/lucene/pull/225#issuecomment-974213961


   > Thanks @dweiss seems this is not the first time we see this error: https://issues.apache.org/jira/browse/LUCENE-9839
   
   Looks like this is (scarily!) pre-existing.  I don't think it should block this change, and I cannot see anything here that might cause that scary exception.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

rmuir commented on pull request #225:
URL: https://github.com/apache/lucene/pull/225#issuecomment-888270205


   > This is super exciting! I'm amazed how little code you needed to get this first version running.
   
   but a runautomaton for this won't run any queries on its own: brute force isn't how these queries actually work. the important part is the intersection (skipping around)...
   
   I suggest, please let's not try to "overshare" and refactor all this stuff alongside DFA stuff until there is a query we can actually benchmark to see if the performance is even viable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] zhaih commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

zhaih commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r677666989



##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##########
@@ -0,0 +1,225 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA
+ * state along with the run
+ *
+ * <p>implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map<DState, Integer> dStateToOrd = new HashMap<>(); // could init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+
+  /**
+   * Constructor, assuming alphabet size is the whole codepoint space
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} for
+   *     better efficiency
+   */
+  public NFARunAutomaton(Automaton automaton) {
+    this(automaton, Character.MAX_CODE_POINT);
+  }
+
+  /**
+   * Constructor
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} *
+   *     for better efficiency
+   * @param alphabetSize alphabet size
+   */
+  public NFARunAutomaton(Automaton automaton, int alphabetSize) {
+    this.automaton = automaton;
+    points = automaton.getStartPoints();
+    this.alphabetSize = alphabetSize;
+    dStates = new DState[10];
+    findDState(new DState(new int[] {0}));
+  }
+
+  /**
+   * For a given state and an incoming character (codepoint), return the next state
+   *
+   * @param state incoming state, should either be 0 or some state that is returned previously by
+   *     this function
+   * @param c codepoint
+   * @return the next state or {@link #MISSING} if the transition doesn't exist
+   */
+  public int step(int state, int c) {
+    assert dStates[state] != null;
+    return step(dStates[state], c);
+  }
+
+  /**
+   * Run through a given codepoint array, return accepted or not, should only be used in test
+   *
+   * @param s String represented by an int array
+   * @return accept or not
+   */
+  boolean run(int[] s) {
+    int p = 0;
+    for (int c : s) {
+      p = step(p, c);
+      if (p == MISSING) return false;
+    }
+    return dStates[p].isAccept;
+  }
+
+  /**
+   * From an existing DFA state, step to next DFA state given character c if the transition is
+   * previously tried then this operation will just use the cached result, otherwise it will call
+   * {@link #step(int[], int)} to get the next state and cache the result
+   */
+  private int step(DState dState, int c) {
+    int charClass = getCharClass(c);
+    if (dState.nextState(charClass) == NOT_COMPUTED) {
+      // the next dfa state has not been computed yet
+      dState.setNextState(charClass, findDState(step(dState.nfaStates, c)));
+    }
+    return dState.nextState(charClass);
+  }
+
+  /**
+   * given a list of NFA states and a character c, compute the output list of NFA state which is
+   * wrapped as a DFA state
+   */
+  private DState step(int[] nfaStates, int c) {
+    Transition transition = new Transition();
+    StateSet stateSet = new StateSet(5); // fork IntHashSet from hppc instead?
+    int numTransitions;
+    for (int nfaState : nfaStates) {
+      numTransitions = automaton.initTransition(nfaState, transition);
+      for (int i = 0; i < numTransitions; i++) {
+        automaton.getNextTransition(transition);
+        if (transition.min <= c && transition.max >= c) {
+          stateSet.incr(transition.dest);
+        }
+      }
+    }
+    if (stateSet.size() == 0) {
+      return null;
+    }
+    return new DState(stateSet.getArray());
+  }
+
+  /**
+   * return the ordinal of given DFA state, generate a new ordinal if the given DFA state is a new
+   * one
+   */
+  private int findDState(DState dState) {
+    if (dState == null) {

Review comment:
       So when try to step into a transition that doesn't exist, the `step(int[] nfaStates, int c)` will return a `null` and passed to here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] zhaih merged pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

zhaih merged pull request #225:
URL: https://github.com/apache/lucene/pull/225


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] zhaih commented on pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

zhaih commented on pull request #225:
URL: https://github.com/apache/lucene/pull/225#issuecomment-985053769


   @rmuir I've done the benchmark! (Sorry for the delay): and result looks plain (which is good for this change)
   ```
                       Task    QPS base      StdDev    QPS cand      StdDev                Pct diff p-value
               OrNotHighMed      523.45      (3.0%)      518.00      (1.7%)   -1.0% (  -5% -    3%) 0.176
                   HighTerm     1385.51      (1.9%)     1374.10      (2.5%)   -0.8% (  -5% -    3%) 0.232
                     Fuzzy2       35.19      (5.9%)       34.91      (5.6%)   -0.8% ( -11% -   11%) 0.666
      HighTermDayOfYearSort      779.01      (2.5%)      772.93      (2.4%)   -0.8% (  -5% -    4%) 0.315
                    Respell       36.33      (1.3%)       36.05      (1.6%)   -0.8% (  -3% -    2%) 0.086
           HighSloppyPhrase        5.08      (6.2%)        5.04      (6.1%)   -0.8% ( -12% -   12%) 0.700
                 AndHighLow      349.64      (3.1%)      347.17      (3.5%)   -0.7% (  -7% -    6%) 0.501
          HighTermMonthSort       37.82      (7.0%)       37.55      (6.9%)   -0.7% ( -13% -   14%) 0.751
                    LowTerm     1041.47      (2.5%)     1034.72      (2.1%)   -0.6% (  -5% -    3%) 0.367
               OrNotHighLow      558.55      (2.0%)      555.05      (2.3%)   -0.6% (  -4% -    3%) 0.358
                 TermDTSort      149.25      (3.7%)      148.36      (3.3%)   -0.6% (  -7% -    6%) 0.592
            LowSloppyPhrase       46.63      (3.7%)       46.37      (3.9%)   -0.6% (  -7% -    7%) 0.647
              OrHighNotHigh      454.25      (1.6%)      452.05      (1.6%)   -0.5% (  -3% -    2%) 0.341
            MedSloppyPhrase       22.73      (2.4%)       22.63      (3.0%)   -0.4% (  -5% -    5%) 0.627
                AndHighHigh       20.19      (2.4%)       20.11      (2.0%)   -0.4% (  -4% -    4%) 0.564
                 AndHighMed       42.22      (2.1%)       42.05      (2.0%)   -0.4% (  -4% -    3%) 0.549
               OrHighNotLow      578.56      (1.7%)      576.35      (2.8%)   -0.4% (  -4% -    4%) 0.599
       HighIntervalsOrdered       11.48      (4.0%)       11.44      (3.9%)   -0.4% (  -7% -    7%) 0.759
       HighTermTitleBDVSort        9.27      (2.8%)        9.23      (2.7%)   -0.4% (  -5% -    5%) 0.664
                    MedTerm      963.30      (2.1%)      959.98      (1.5%)   -0.3% (  -3% -    3%) 0.544
                   Wildcard       83.18      (4.2%)       82.91      (4.4%)   -0.3% (  -8% -    8%) 0.811
        LowIntervalsOrdered       11.41      (3.2%)       11.38      (2.8%)   -0.3% (  -6% -    5%) 0.768
        MedIntervalsOrdered       16.77      (2.6%)       16.72      (2.3%)   -0.3% (  -5% -    4%) 0.722
               HighSpanNear        1.06      (1.0%)        1.05      (1.3%)   -0.2% (  -2% -    2%) 0.506
                LowSpanNear        3.06      (1.1%)        3.05      (1.4%)   -0.2% (  -2% -    2%) 0.599
   BrowseDayOfYearSSDVFacets        2.88      (2.5%)        2.87      (2.6%)   -0.2% (  -5% -    5%) 0.827
              OrNotHighHigh      466.02      (2.3%)      465.36      (1.9%)   -0.1% (  -4% -    4%) 0.828
       BrowseDateTaxoFacets        0.84      (5.3%)        0.84      (4.4%)   -0.1% (  -9% -   10%) 0.933
                     IntNRQ       28.10      (0.7%)       28.08      (1.2%)   -0.1% (  -1% -    1%) 0.762
   BrowseDayOfYearTaxoFacets        0.84      (5.3%)        0.84      (4.4%)   -0.1% (  -9% -   10%) 0.957
      BrowseMonthSSDVFacets        2.99      (3.5%)        2.99      (3.7%)   -0.1% (  -7% -    7%) 0.962
      BrowseMonthTaxoFacets        0.87      (5.6%)        0.87      (4.7%)   -0.0% (  -9% -   10%) 0.979
                  OrHighLow      373.02      (3.0%)      372.89      (2.7%)   -0.0% (  -5% -    5%) 0.968
                   PKLookup      131.84      (2.4%)      131.83      (2.9%)   -0.0% (  -5% -    5%) 0.992
                    Prefix3       93.52      (3.2%)       93.51      (3.5%)   -0.0% (  -6% -    6%) 0.997
                 OrHighHigh       30.69      (2.1%)       30.69      (2.1%)    0.0% (  -4% -    4%) 0.989
                MedSpanNear       23.34      (1.3%)       23.35      (1.2%)    0.0% (  -2% -    2%) 0.958
                     Fuzzy1       44.19      (7.7%)       44.20      (8.2%)    0.0% ( -14% -   17%) 0.991
                 HighPhrase       89.39      (2.0%)       89.43      (1.9%)    0.0% (  -3% -    4%) 0.943
                  LowPhrase       27.84      (2.0%)       27.88      (2.6%)    0.1% (  -4% -    4%) 0.853
                  OrHighMed       31.74      (1.9%)       31.80      (1.8%)    0.2% (  -3% -    3%) 0.763
                  MedPhrase       82.20      (1.7%)       82.54      (2.4%)    0.4% (  -3% -    4%) 0.520
               OrHighNotMed      501.51      (1.7%)      505.78      (2.3%)    0.9% (  -3% -    4%) 0.181
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] sonatype-lift[bot] commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

sonatype-lift[bot] commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r762178744



##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##########
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA. It will lazily determinize on-demand, memorizing the
+ * generated DFA states that has been explored
+ *
+ * <p>implemented based on: https://swtch.com/~rsc/regexp/regexp1.html
+ */
+public class NFARunAutomaton implements ByteRunnable, TransitionAccessor {
+
+  /** state ordinal of "no such state" */
+  public static final int MISSING = -1;
+
+  private static final int NOT_COMPUTED = -2;
+
+  private final Automaton automaton;
+  private final int[] points;
+  private final Map<DState, Integer> dStateToOrd = new HashMap<>(); // could init lazily?
+  private DState[] dStates;
+  private final int alphabetSize;
+  final int[] classmap; // map from char number to class
+
+  private final Operations.PointTransitionSet transitionSet =
+      new Operations.PointTransitionSet(); // reusable
+  private final StateSet statesSet = new StateSet(5); // reusable
+
+  /**
+   * Constructor, assuming alphabet size is the whole Unicode code point space
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} for
+   *     better efficiency
+   */
+  public NFARunAutomaton(Automaton automaton) {
+    this(automaton, Character.MAX_CODE_POINT);
+  }
+
+  /**
+   * Constructor
+   *
+   * @param automaton incoming automaton, should be NFA, for DFA please use {@link RunAutomaton} *
+   *     for better efficiency
+   * @param alphabetSize alphabet size
+   */
+  public NFARunAutomaton(Automaton automaton, int alphabetSize) {
+    this.automaton = automaton;
+    points = automaton.getStartPoints();
+    this.alphabetSize = alphabetSize;
+    dStates = new DState[10];
+    findDState(new DState(new int[] {0}));
+
+    /*
+     * Set alphabet table for optimal run performance.
+     */
+    classmap = new int[Math.min(256, alphabetSize)];
+    int i = 0;
+    for (int j = 0; j < classmap.length; j++) {
+      if (i + 1 < points.length && j == points[i + 1]) {
+        i++;
+      }
+      classmap[j] = i;
+    }
+  }
+
+  /**
+   * For a given state and an incoming character (codepoint), return the next state
+   *
+   * @param state incoming state, should either be 0 or some state that is returned previously by
+   *     this function
+   * @param c codepoint
+   * @return the next state or {@link #MISSING} if the transition doesn't exist
+   */
+  @Override
+  public int step(int state, int c) {
+    assert dStates[state] != null;
+    return step(dStates[state], c);
+  }
+
+  @Override
+  public boolean isAccept(int state) {
+    assert dStates[state] != null;
+    return dStates[state].isAccept;
+  }
+
+  @Override
+  public int getSize() {
+    return dStates.length;
+  }
+
+  /**
+   * Run through a given codepoint array, return accepted or not, should only be used in test
+   *
+   * @param s String represented by an int array
+   * @return accept or not
+   */
+  boolean run(int[] s) {
+    int p = 0;
+    for (int c : s) {
+      p = step(p, c);
+      if (p == MISSING) return false;
+    }
+    return dStates[p].isAccept;
+  }
+
+  /**
+   * From an existing DFA state, step to next DFA state given character c if the transition is
+   * previously tried then this operation will just use the cached result, otherwise it will call
+   * {@link DState#step(int)} to get the next state and cache the result
+   */
+  private int step(DState dState, int c) {
+    int charClass = getCharClass(c);
+    return dState.nextState(charClass);
+  }
+
+  /**
+   * return the ordinal of given DFA state, generate a new ordinal if the given DFA state is a new
+   * one
+   */
+  private int findDState(DState dState) {
+    if (dState == null) {
+      return MISSING;
+    }
+    int ord = dStateToOrd.getOrDefault(dState, -1);
+    if (ord >= 0) {
+      return ord;
+    }
+    ord = dStateToOrd.size();
+    dStateToOrd.put(dState, ord);
+    assert ord >= dStates.length || dStates[ord] == null;
+    if (ord >= dStates.length) {
+      dStates = ArrayUtil.grow(dStates, ord + 1);
+    }
+    dStates[ord] = dState;
+    return ord;
+  }
+
+  /** Gets character class of given codepoint */
+  final int getCharClass(int c) {
+    assert c < alphabetSize;
+
+    if (c < classmap.length) {
+      return classmap[c];
+    }
+
+    // binary search
+    int a = 0;
+    int b = points.length;
+    while (b - a > 1) {
+      int d = (a + b) >>> 1;
+      if (points[d] > c) b = d;
+      else if (points[d] < c) a = d;
+      else return d;
+    }
+    return a;
+  }
+
+  @Override
+  public int initTransition(int state, Transition t) {
+    t.source = state;
+    t.transitionUpto = -1;
+    return getNumTransitions(state);
+  }
+
+  @Override
+  public void getNextTransition(Transition t) {
+    assert t.transitionUpto < points.length - 1 && t.transitionUpto >= -1;
+    while (dStates[t.source].transitions[++t.transitionUpto] == MISSING) {
+      // this shouldn't throw AIOOBE as long as this function is only called
+      // numTransitions times
+    }
+    assert dStates[t.source].transitions[t.transitionUpto] != NOT_COMPUTED;
+    t.dest = dStates[t.source].transitions[t.transitionUpto];
+
+    t.min = points[t.transitionUpto];
+    if (t.transitionUpto == points.length - 1) {
+      t.max = alphabetSize - 1;
+    } else {
+      t.max = points[t.transitionUpto + 1] - 1;
+    }
+  }
+
+  @Override
+  public int getNumTransitions(int state) {
+    dStates[state].determinize();
+    return dStates[state].outgoingTransitions;
+  }
+
+  @Override
+  public void getTransition(int state, int index, Transition t) {
+    dStates[state].determinize();
+    int outgoingTransitions = -1;
+    t.transitionUpto = -1;
+    t.source = state;
+    while (outgoingTransitions < index && t.transitionUpto < points.length - 1) {
+      if (dStates[t.source].transitions[++t.transitionUpto] != MISSING) {
+        outgoingTransitions++;
+      }
+    }
+    assert outgoingTransitions == index;
+
+    t.min = points[t.transitionUpto];
+    if (t.transitionUpto == points.length - 1) {
+      t.max = alphabetSize - 1;
+    } else {
+      t.max = points[t.transitionUpto + 1] - 1;
+    }
+  }
+
+  private class DState {
+    private final int[] nfaStates;
+    // this field is lazily init'd when first time caller wants to add a new transition
+    private int[] transitions;
+    private final int hashCode;
+    private final boolean isAccept;
+    private final Transition stepTransition = new Transition();
+    private Transition minimalTransition;
+    private int computedTransitions;
+    private int outgoingTransitions;
+
+    private DState(int[] nfaStates) {
+      assert nfaStates != null && nfaStates.length > 0;
+      this.nfaStates = nfaStates;
+      int hashCode = nfaStates.length;
+      boolean isAccept = false;
+      for (int s : nfaStates) {
+        hashCode += BitMixer.mix(s);
+        if (automaton.isAccept(s)) {
+          isAccept = true;
+        }
+      }
+      this.isAccept = isAccept;
+      this.hashCode = hashCode;
+    }
+
+    private int nextState(int charClass) {
+      initTransitions();
+      assert charClass < transitions.length;
+      if (transitions[charClass] == NOT_COMPUTED) {
+        assignTransition(charClass, findDState(step(points[charClass])));
+        // we could potentially update more than one char classes
+        if (minimalTransition != null) {
+          // to the left
+          int cls = charClass;
+          while (cls > 0 && points[--cls] >= minimalTransition.min) {
+            assert transitions[cls] == NOT_COMPUTED || transitions[cls] == transitions[charClass];
+            assignTransition(cls, transitions[charClass]);
+          }
+          // to the right
+          cls = charClass;
+          while (cls < points.length - 1 && points[++cls] <= minimalTransition.max) {
+            assert transitions[cls] == NOT_COMPUTED || transitions[cls] == transitions[charClass];
+            assignTransition(cls, transitions[charClass]);
+          }
+          minimalTransition = null;
+        }
+      }
+      return transitions[charClass];
+    }
+
+    private void assignTransition(int charClass, int dest) {
+      if (transitions[charClass] == NOT_COMPUTED) {
+        computedTransitions++;
+        transitions[charClass] = dest;
+        if (transitions[charClass] != MISSING) {
+          outgoingTransitions++;
+        }
+      }
+    }
+
+    /**
+     * given a list of NFA states and a character c, compute the output list of NFA state which is
+     * wrapped as a DFA state
+     */
+    private DState step(int c) {
+      statesSet.reset(); // TODO: fork IntHashSet from hppc instead?
+      int numTransitions;
+      int left = -1, right = alphabetSize;
+      for (int nfaState : nfaStates) {
+        numTransitions = automaton.initTransition(nfaState, stepTransition);
+        // TODO: binary search should be faster, since transitions are sorted
+        for (int i = 0; i < numTransitions; i++) {
+          automaton.getNextTransition(stepTransition);
+          if (stepTransition.min <= c && stepTransition.max >= c) {
+            statesSet.incr(stepTransition.dest);
+            left = Math.max(stepTransition.min, left);
+            right = Math.min(stepTransition.max, right);
+          }
+          if (stepTransition.max < c) {
+            left = Math.max(stepTransition.max + 1, left);
+          }
+          if (stepTransition.min > c) {
+            right = Math.min(stepTransition.min - 1, right);
+            // transitions in automaton are sorted
+            break;
+          }
+        }
+      }
+      if (statesSet.size() == 0) {
+        return null;
+      }
+      minimalTransition = new Transition();
+      minimalTransition.min = left;
+      minimalTransition.max = right;
+      return new DState(statesSet.getArray());
+    }
+
+    // determinize this state only
+    private void determinize() {
+      if (transitions != null && computedTransitions == transitions.length) {
+        // already determinized
+        return;
+      }
+      initTransitions();
+      // Mostly forked from Operations.determinize
+      transitionSet.reset();
+      for (int nfaState : nfaStates) {
+        int numTransitions = automaton.initTransition(nfaState, stepTransition);
+        for (int i = 0; i < numTransitions; i++) {
+          automaton.getNextTransition(stepTransition);
+          transitionSet.add(stepTransition);
+        }
+      }
+      if (transitionSet.count == 0) {
+        // no outgoing transitions
+        Arrays.fill(transitions, MISSING);
+        computedTransitions = transitions.length;
+        return;
+      }
+
+      transitionSet
+          .sort(); // TODO: could use a PQ (heap) instead, since transitions for each state are
+      // sorted
+      statesSet.reset();
+      int lastPoint = -1;
+      int charClass = 0;
+      for (int i = 0; i < transitionSet.count; i++) {
+        final int point = transitionSet.points[i].point;
+        if (statesSet.size() > 0) {
+          assert lastPoint != -1;
+          int ord = findDState(new DState(statesSet.getArray()));
+          while (points[charClass] < lastPoint) {
+            assignTransition(charClass++, MISSING);
+          }
+          assert points[charClass] == lastPoint;
+          while (charClass < points.length && points[charClass] < point) {
+            assert transitions[charClass] == NOT_COMPUTED || transitions[charClass] == ord;
+            assignTransition(charClass++, ord);
+          }
+          assert (charClass == points.length && point == alphabetSize)
+              || points[charClass] == point;
+        }
+
+        // process transitions that end on this point
+        // (closes an overlapping interval)
+        int[] transitions = transitionSet.points[i].ends.transitions;
+        int limit = transitionSet.points[i].ends.next;
+        for (int j = 0; j < limit; j += 3) {
+          int dest = transitions[j];
+          statesSet.decr(dest);
+        }
+        transitionSet.points[i].ends.next = 0;
+
+        // process transitions that start on this point
+        // (opens a new interval)
+        transitions = transitionSet.points[i].starts.transitions;
+        limit = transitionSet.points[i].starts.next;
+        for (int j = 0; j < limit; j += 3) {
+          int dest = transitions[j];
+          statesSet.incr(dest);
+        }
+
+        lastPoint = point;
+        transitionSet.points[i].starts.next = 0;
+      }
+      assert statesSet.size() == 0;
+      assert computedTransitions
+          >= charClass; // it's also possible that some transitions after the charClass has already
+      // been explored
+      // no more outgoing transitions, set rest of transition to MISSING
+      assert charClass == transitions.length
+          || transitions[charClass] == MISSING
+          || transitions[charClass] == NOT_COMPUTED;
+      Arrays.fill(transitions, charClass, transitions.length, MISSING);
+      computedTransitions = transitions.length;
+    }
+
+    private void initTransitions() {
+      if (transitions == null) {
+        transitions = new int[points.length];
+        Arrays.fill(transitions, NOT_COMPUTED);
+      }
+    }
+
+    @Override
+    public int hashCode() {
+      return hashCode;
+    }
+
+    @Override
+    public boolean equals(Object o) {

Review comment:
       I've recorded this as ignored for this pull request. If you change your mind, just comment `@sonatype-lift unignore`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mikemccand commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

mikemccand commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r715531456



##########
File path: lucene/codecs/src/java/org/apache/lucene/codecs/uniformsplit/IntersectBlockReader.java
##########
@@ -384,15 +390,18 @@ protected AutomatonNextTermCalculator(CompiledAutomaton compiled) {
     }
 
     /** Records the given state has been visited. */
-    protected void setVisited(int state) {
+    private void setVisited(int state) {
       if (!finite) {

Review comment:
       Maybe fix to `finite == false` since you are here :)

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
##########
@@ -551,12 +551,22 @@ static RegExp newLeafNode(
     return new RegExp(flags, kind, null, null, s, c, min, max, digits, from, to);
   }
 
+  /**
+   * Return an <code>Automaton</code> from this <code>RegExp</code> that will skip the determinize
+   * and minimize step
+   *
+   * @return {@link Automaton} most likely non-deterministic
+   */
+  public Automaton toNFA() {

Review comment:
       I wonder just how "NFA" this Automaton really is.  Like for a simple regexp, what does the NFA even look like?  I know the `RegExp` code makes heavy use of `.addEpsilon` which creates many copies of transitions, etc.

##########
File path: lucene/core/src/java/org/apache/lucene/search/AutomatonQuery.java
##########
@@ -96,12 +109,35 @@ public AutomatonQuery(final Term term, Automaton automaton, int determinizeWorkL
    */
   public AutomatonQuery(
       final Term term, Automaton automaton, int determinizeWorkLimit, boolean isBinary) {
+    this(term, automaton, determinizeWorkLimit, isBinary, ByteRunnable.TYPE.DFA);
+  }
+
+  /**
+   * Create a new AutomatonQuery from an {@link Automaton}.
+   *
+   * @param term Term containing field and possibly some pattern structure. The term text is
+   *     ignored.
+   * @param automaton Automaton to run, terms that are accepted are considered a match.
+   * @param determinizeWorkLimit maximum effort to spend determinizing the automaton. If the
+   *     automaton will need more than this much effort, TooComplexToDeterminizeException is thrown.
+   *     Higher numbers require more space but can process more complex automata.
+   * @param isBinary if true, this automaton is already binary and will not go through the
+   *     UTF32ToUTF8 conversion
+   * @param runnableType NFA or DFA
+   */
+  public AutomatonQuery(

Review comment:
       Cool, so the existing ctor remains, defaulting to `DFA` execution strategy, where the automaton is first fully determinized.
   
   But now you add another ctor, letting users also ask for `NFA` execution, where the automaton is determinized lazily on-demand and only in those parts that the terms in this index need to visit.

##########
File path: lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene40/blocktree/FieldReader.java
##########
@@ -187,6 +187,14 @@ public TermsEnum intersect(CompiledAutomaton compiled, BytesRef startTerm) throw
     if (compiled.type != CompiledAutomaton.AUTOMATON_TYPE.NORMAL) {
       throw new IllegalArgumentException("please use CompiledAutomaton.getTermsEnum instead");
     }
+    if (compiled.nfaRunAutomaton != null) {
+      return new IntersectTermsEnum(

Review comment:
       Ahh, so it was too difficult to support `nfaRunAutomaton` also in `BlockTree`?  This probably hurts performance quite a bit for `NFAQuery` -- `BlockTree`'s specialized `intersect` impl is fast.  But we can optimize later.

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/CompiledAutomaton.java
##########
@@ -133,7 +137,35 @@ private static int findSinkState(Automaton automaton) {
    * is one the cases in {@link CompiledAutomaton.AUTOMATON_TYPE}.
    */
   public CompiledAutomaton(Automaton automaton, Boolean finite, boolean simplify) {
-    this(automaton, finite, simplify, Operations.DEFAULT_DETERMINIZE_WORK_LIMIT, false);
+    this(automaton, finite, simplify, ByteRunnable.TYPE.DFA);

Review comment:
       Good -- the existing ctors remain and default to `DFA` strategy.

##########
File path: lucene/codecs/src/java/org/apache/lucene/codecs/memory/DirectPostingsFormat.java
##########
@@ -962,15 +964,22 @@ public ImpactsEnum impacts(int flags) throws IOException {
       private int stateUpto;
 
       public DirectIntersectTermsEnum(CompiledAutomaton compiled, BytesRef startTerm) {
-        runAutomaton = compiled.runAutomaton;
-        compiledAutomaton = compiled;
+        if (compiled.nfaRunAutomaton != null) {
+          this.runAutomaton = compiled.nfaRunAutomaton;

Review comment:
       Maybe instead of having separate `nfaRunAutomaton` and `runAutomaton` we could have only `runAutomaton` and a separate `CompiledAutomaton.isDeterminized` boolean?

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/CompiledAutomaton.java
##########
@@ -250,15 +291,23 @@ public CompiledAutomaton(
       }
     }
 
-    // This will determinize the binary automaton for us:
-    runAutomaton = new ByteRunAutomaton(binary, true, determinizeWorkLimit);
+    if (automaton.isDeterministic() == false && byteRunnableType == ByteRunnable.TYPE.NFA) {

Review comment:
       Are we still pulling the common prefix/suffix even in `NFA` mode?  @rmuir recently improved those operations to not require a determinized automaton.

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/ByteRunnable.java
##########
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.automaton;
+
+/** A runnable automaton accepting byte array as input */
+public interface ByteRunnable {
+
+  /** NFA or DFA */
+  enum TYPE {
+    /** use NFARunAutomaton */

Review comment:
       Can we improve these javadocs?  Instead of referring to internal classes, let's write it as seen from a somewhat less knowledgeable external future user.
   
   E.g. for `DFA`, something like `Fully determinize the automaton up-front for fast term intersection.  Some RegExps may fail to determinize, throwing TooComplexToDeterminizeException.  But if they do not, intersection is fast.`, and for `NFA`, something like `Determinize the automaton lazily on-demand as terms are intersected.  This option saves the up-front determinize cost, and can handle some RegExps that DFA cannot, but intersection will be a bit slower`?

##########
File path: lucene/core/src/test/org/apache/lucene/util/automaton/TestNFARunAutomaton.java
##########
@@ -0,0 +1,188 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.Set;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.RandomIndexWriter;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.search.AutomatonQuery;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.util.IntsRef;
+import org.apache.lucene.util.LuceneTestCase;
+import org.apache.lucene.util.TestUtil;
+
+public class TestNFARunAutomaton extends LuceneTestCase {
+
+  private static final String FIELD = "field";
+
+  public void testWithRandomRegex() {
+    for (int i = 0; i < 100; i++) {
+      RegExp regExp = null;
+      while (regExp == null) {
+        try {
+          regExp = new RegExp(AutomatonTestUtil.randomRegexp(random()));
+        } catch (IllegalArgumentException e) {
+          ignoreException(e);
+        }
+      }
+      Automaton nfa = regExp.toNFA();
+      if (nfa.isDeterministic()) {
+        i--;
+        continue;
+      }
+      Automaton dfa = regExp.toAutomaton();
+      NFARunAutomaton candidate = new NFARunAutomaton(nfa);
+      AutomatonTestUtil.RandomAcceptedStrings randomStringGen;
+      try {
+        randomStringGen = new AutomatonTestUtil.RandomAcceptedStrings(dfa);
+      } catch (IllegalArgumentException e) {
+        ignoreException(e);
+        i--;
+        continue; // sometimes the automaton accept nothing and throw this exception
+      }
+
+      for (int round = 0; round < 20; round++) {
+        // test order of accepted strings and random (likely rejected) strings alternatively to make
+        // sure caching system works correctly
+        if (random().nextBoolean()) {
+          testAcceptedString(regExp, randomStringGen, candidate, 10);
+          testRandomString(regExp, dfa, candidate, 10);
+        } else {
+          testRandomString(regExp, dfa, candidate, 10);
+          testAcceptedString(regExp, randomStringGen, candidate, 10);
+        }
+      }
+    }
+  }
+
+  public void testWithRandomAutomatonQuery() throws IOException {
+    final int n = 5;
+    for (int i = 0; i < n; i++) {
+      randomAutomatonQueryTest();
+    }
+  }
+
+  private void randomAutomatonQueryTest() throws IOException {
+    final int docNum = 50;
+    final int automatonNum = 50;
+    Directory directory = newDirectory();
+    RandomIndexWriter writer = new RandomIndexWriter(random(), directory);
+
+    Set<String> vocab = new HashSet<>();
+    Set<String> perLoopReuse = new HashSet<>();
+    for (int i = 0; i < docNum; i++) {
+      perLoopReuse.clear();
+      int termNum = random().nextInt(20) + 30;
+      while (perLoopReuse.size() < termNum) {
+        String randomString;
+        while ((randomString = TestUtil.randomUnicodeString(random())).length() == 0)
+          ;
+        perLoopReuse.add(randomString);
+        vocab.add(randomString);
+      }
+      Document document = new Document();
+      document.add(
+          newTextField(
+              FIELD, perLoopReuse.stream().reduce("", (s1, s2) -> s1 + " " + s2), Field.Store.NO));
+      writer.addDocument(document);
+    }
+    writer.commit();
+    IndexReader reader = DirectoryReader.open(directory);
+    IndexSearcher searcher = new IndexSearcher(reader);
+
+    Set<String> foreignVocab = new HashSet<>();
+    while (foreignVocab.size() < vocab.size()) {
+      String randomString;
+      while ((randomString = TestUtil.randomUnicodeString(random())).length() == 0)
+        ;
+      foreignVocab.add(randomString);
+    }
+
+    ArrayList<String> vocabList = new ArrayList<>(vocab);
+    ArrayList<String> foreignVocabList = new ArrayList<>(foreignVocab);
+
+    for (int i = 0; i < automatonNum; i++) {
+      perLoopReuse.clear();

Review comment:
       And then make a new (still reused) `HashSet` here, `perQueryVocab`?

##########
File path: lucene/core/src/test/org/apache/lucene/util/automaton/TestNFARunAutomaton.java
##########
@@ -0,0 +1,188 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.Set;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.RandomIndexWriter;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.search.AutomatonQuery;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.util.IntsRef;
+import org.apache.lucene.util.LuceneTestCase;
+import org.apache.lucene.util.TestUtil;
+
+public class TestNFARunAutomaton extends LuceneTestCase {
+
+  private static final String FIELD = "field";
+
+  public void testWithRandomRegex() {
+    for (int i = 0; i < 100; i++) {
+      RegExp regExp = null;
+      while (regExp == null) {
+        try {
+          regExp = new RegExp(AutomatonTestUtil.randomRegexp(random()));
+        } catch (IllegalArgumentException e) {
+          ignoreException(e);
+        }
+      }
+      Automaton nfa = regExp.toNFA();
+      if (nfa.isDeterministic()) {
+        i--;
+        continue;
+      }
+      Automaton dfa = regExp.toAutomaton();
+      NFARunAutomaton candidate = new NFARunAutomaton(nfa);
+      AutomatonTestUtil.RandomAcceptedStrings randomStringGen;
+      try {
+        randomStringGen = new AutomatonTestUtil.RandomAcceptedStrings(dfa);
+      } catch (IllegalArgumentException e) {
+        ignoreException(e);
+        i--;
+        continue; // sometimes the automaton accept nothing and throw this exception
+      }
+
+      for (int round = 0; round < 20; round++) {
+        // test order of accepted strings and random (likely rejected) strings alternatively to make
+        // sure caching system works correctly
+        if (random().nextBoolean()) {
+          testAcceptedString(regExp, randomStringGen, candidate, 10);
+          testRandomString(regExp, dfa, candidate, 10);
+        } else {
+          testRandomString(regExp, dfa, candidate, 10);
+          testAcceptedString(regExp, randomStringGen, candidate, 10);
+        }
+      }
+    }
+  }
+
+  public void testWithRandomAutomatonQuery() throws IOException {
+    final int n = 5;
+    for (int i = 0; i < n; i++) {
+      randomAutomatonQueryTest();
+    }
+  }
+
+  private void randomAutomatonQueryTest() throws IOException {
+    final int docNum = 50;
+    final int automatonNum = 50;
+    Directory directory = newDirectory();
+    RandomIndexWriter writer = new RandomIndexWriter(random(), directory);
+
+    Set<String> vocab = new HashSet<>();
+    Set<String> perLoopReuse = new HashSet<>();
+    for (int i = 0; i < docNum; i++) {
+      perLoopReuse.clear();
+      int termNum = random().nextInt(20) + 30;
+      while (perLoopReuse.size() < termNum) {
+        String randomString;
+        while ((randomString = TestUtil.randomUnicodeString(random())).length() == 0)
+          ;
+        perLoopReuse.add(randomString);
+        vocab.add(randomString);
+      }
+      Document document = new Document();
+      document.add(
+          newTextField(
+              FIELD, perLoopReuse.stream().reduce("", (s1, s2) -> s1 + " " + s2), Field.Store.NO));
+      writer.addDocument(document);
+    }
+    writer.commit();
+    IndexReader reader = DirectoryReader.open(directory);
+    IndexSearcher searcher = new IndexSearcher(reader);
+
+    Set<String> foreignVocab = new HashSet<>();
+    while (foreignVocab.size() < vocab.size()) {
+      String randomString;
+      while ((randomString = TestUtil.randomUnicodeString(random())).length() == 0)
+        ;
+      foreignVocab.add(randomString);
+    }
+
+    ArrayList<String> vocabList = new ArrayList<>(vocab);
+    ArrayList<String> foreignVocabList = new ArrayList<>(foreignVocab);
+
+    for (int i = 0; i < automatonNum; i++) {
+      perLoopReuse.clear();
+      int termNum = random().nextInt(40) + 30;
+      while (perLoopReuse.size() < termNum) {
+        if (random().nextBoolean()) {
+          perLoopReuse.add(vocabList.get(random().nextInt(vocabList.size())));
+        } else {
+          perLoopReuse.add(foreignVocabList.get(random().nextInt(foreignVocabList.size())));
+        }
+      }
+      Automaton a = null;
+      for (String term : perLoopReuse) {
+        if (a == null) {
+          a = Automata.makeString(term);
+        } else {
+          a = Operations.union(a, Automata.makeString(term));
+        }
+      }
+      if (a.isDeterministic()) {
+        i--;
+        continue;
+      }
+      AutomatonQuery dfaQuery = new AutomatonQuery(new Term(FIELD), a);
+      AutomatonQuery nfaQuery = new AutomatonQuery(new Term(FIELD), a, ByteRunnable.TYPE.NFA);

Review comment:
       Could you add a new `LuceneTestCase` method, `newAutomatonQuery`, and it would randomly pick between `NFA` and `DFA` type?  And then in a few pre-existing tests, let's call `newAutomatonQuery` instead of `new AutomatonQuery`?
   
   That method could also randomly make the automaton non-deterministic by simple cloning a few states?  We can do this in a follow-on issue.

##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/NFARunAutomaton.java
##########
@@ -0,0 +1,429 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.Map;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.hppc.BitMixer;
+
+/**
+ * A RunAutomaton that does not require DFA, it will determinize and memorize the generated DFA

Review comment:
       Period after `DFA`.
   
   And maybe say `It will lazily determinize on-demand, memorizing the generated DFA states that indexed terms have intersected with`.

##########
File path: lucene/core/src/test/org/apache/lucene/util/automaton/TestNFARunAutomaton.java
##########
@@ -0,0 +1,188 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.util.automaton;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.Set;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.RandomIndexWriter;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.search.AutomatonQuery;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.util.IntsRef;
+import org.apache.lucene.util.LuceneTestCase;
+import org.apache.lucene.util.TestUtil;
+
+public class TestNFARunAutomaton extends LuceneTestCase {
+
+  private static final String FIELD = "field";
+
+  public void testWithRandomRegex() {
+    for (int i = 0; i < 100; i++) {
+      RegExp regExp = null;
+      while (regExp == null) {
+        try {
+          regExp = new RegExp(AutomatonTestUtil.randomRegexp(random()));
+        } catch (IllegalArgumentException e) {
+          ignoreException(e);
+        }
+      }
+      Automaton nfa = regExp.toNFA();
+      if (nfa.isDeterministic()) {
+        i--;
+        continue;
+      }
+      Automaton dfa = regExp.toAutomaton();
+      NFARunAutomaton candidate = new NFARunAutomaton(nfa);
+      AutomatonTestUtil.RandomAcceptedStrings randomStringGen;
+      try {
+        randomStringGen = new AutomatonTestUtil.RandomAcceptedStrings(dfa);
+      } catch (IllegalArgumentException e) {
+        ignoreException(e);
+        i--;
+        continue; // sometimes the automaton accept nothing and throw this exception
+      }
+
+      for (int round = 0; round < 20; round++) {
+        // test order of accepted strings and random (likely rejected) strings alternatively to make
+        // sure caching system works correctly
+        if (random().nextBoolean()) {
+          testAcceptedString(regExp, randomStringGen, candidate, 10);
+          testRandomString(regExp, dfa, candidate, 10);
+        } else {
+          testRandomString(regExp, dfa, candidate, 10);
+          testAcceptedString(regExp, randomStringGen, candidate, 10);
+        }
+      }
+    }
+  }
+
+  public void testWithRandomAutomatonQuery() throws IOException {
+    final int n = 5;
+    for (int i = 0; i < n; i++) {
+      randomAutomatonQueryTest();
+    }
+  }
+
+  private void randomAutomatonQueryTest() throws IOException {
+    final int docNum = 50;
+    final int automatonNum = 50;
+    Directory directory = newDirectory();
+    RandomIndexWriter writer = new RandomIndexWriter(random(), directory);
+
+    Set<String> vocab = new HashSet<>();
+    Set<String> perLoopReuse = new HashSet<>();

Review comment:
       Maybe rename to `perDocVocab`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] zhaih commented on pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

zhaih commented on pull request #225:
URL: https://github.com/apache/lucene/pull/225#issuecomment-982125853


   Thanks @rmuir, I'll run a benchmark to ensure this PR does not introduce regression recently.
   
   I like the approach you proposed in #485, it would be nice if we can get rid of `determinizeWorkLimit` in some classes that previously exists everywhere. One reason for carrying an enum and the `determinizeWorkLimit` together is that we might want to use that `determinizeWorkLimit` to limit the number of state that NFA can cache as well. But that's a feature not implemented yet and could be done in some other ways.
   
   I think we can try to get that pushed and then I can rebase this one after.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] zhaih commented on pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

zhaih commented on pull request #225:
URL: https://github.com/apache/lucene/pull/225#issuecomment-998225878


   OK @rmuir some new commits are ready to be reviewed! Please take your time :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] zhaih commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

zhaih commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r726750180



##########
File path: lucene/core/src/java/org/apache/lucene/search/AutomatonQuery.java
##########
@@ -96,12 +110,36 @@ public AutomatonQuery(final Term term, Automaton automaton, int determinizeWorkL
    */
   public AutomatonQuery(
       final Term term, Automaton automaton, int determinizeWorkLimit, boolean isBinary) {
+    this(term, automaton, determinizeWorkLimit, isBinary, ByteRunnable.TYPE.DFA);
+  }
+
+  /**
+   * Create a new AutomatonQuery from an {@link Automaton}.
+   *
+   * @param term Term containing field and possibly some pattern structure. The term text is
+   *     ignored.
+   * @param automaton Automaton to run, terms that are accepted are considered a match.
+   * @param determinizeWorkLimit maximum effort to spend determinizing the automaton. If the
+   *     automaton will need more than this much effort, TooComplexToDeterminizeException is thrown.
+   *     Higher numbers require more space but can process more complex automata.
+   * @param isBinary if true, this automaton is already binary and will not go through the
+   *     UTF32ToUTF8 conversion
+   * @param runnableType NFA or DFA. See {@link org.apache.lucene.util.automaton.ByteRunnable.TYPE}
+   *     for difference between NFA and DFA. Also note * that NFA has uncertain performance impact

Review comment:
       Ah good catch! I guess it happened when I was moving the last line up.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] zhaih commented on a change in pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

zhaih commented on a change in pull request #225:
URL: https://github.com/apache/lucene/pull/225#discussion_r726749051



##########
File path: lucene/core/src/java/org/apache/lucene/util/automaton/ByteRunnable.java
##########
@@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.automaton;
+
+/** A runnable automaton accepting byte array as input */
+public interface ByteRunnable {
+
+  /** NFA or DFA */
+  enum TYPE {
+    /**
+     * Determinize the automaton lazily on-demand as terms are intersected. This option saves the
+     * up-front determinize cost, and can handle some RegExps that DFA cannot, but intersection will
+     * be a bit slower

Review comment:
       Oh this page is linked in NFARunAutomaton's javadoc already :)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] dweiss commented on pull request #225: LUCENE-10010 Introduce NFARunAutomaton to run NFA directly

Posted by GitBox <gi...@apache.org>.

dweiss commented on pull request #225:
URL: https://github.com/apache/lucene/pull/225#issuecomment-941302842


   The stack trace is interesting though - looks like a double close:
   ```
   org.apache.lucene.index.TestIndexFileDeleter > testExcInDecRef FAILED
       org.apache.lucene.store.AlreadyClosedException: ReaderPool is already closed
           at __randomizedtesting.SeedInfo.seed([FD14DA9475FFAE2C:1489ADA6033649D1]:0)
           at app//org.apache.lucene.index.ReaderPool.get(ReaderPool.java:400)
           at app//org.apache.lucene.index.IndexWriter.writeReaderPool(IndexWriter.java:3742)
           at app//org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:590)
           at app//org.apache.lucene.index.RandomIndexWriter.getReader(RandomIndexWriter.java:474)
           at app//org.apache.lucene.index.RandomIndexWriter.getReader(RandomIndexWriter.java:406)
           at app//org.apache.lucene.index.TestIndexFileDeleter.testExcInDecRef(TestIndexFileDeleter.java:484)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org