You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by "gsmiller (via GitHub)" <gi...@apache.org> on 2023/05/21 02:52:21 UTC

[GitHub] [lucene] gsmiller opened a new pull request, #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit

gsmiller opened a new pull request, #12320:
URL: https://github.com/apache/lucene/pull/12320

   ### Description
   
   Adds the ability to directly build a binary automaton for a string union using the Daciuk-Mihov algorithm, and uses it to make the `TermInSetQuery#visit` implementation a little more optimal. I'm hoping we end up moving to an automaton approach in general for `TermInSetQuery` (see #12312), but I think this is a good iterative step for now, as suggested by @rmuir / @mikemccand in #12310.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mikemccand commented on pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.

mikemccand commented on PR #12320:
URL: https://github.com/apache/lucene/pull/12320#issuecomment-1564480240

   > Resolving the class naming conflicts from `main` was a bit of a hassle with an incremental git history.
   
   Woops, sorry!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a diff in pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.

gsmiller commented on code in PR #12320:
URL: https://github.com/apache/lucene/pull/12320#discussion_r1205834475


##########
lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java:
##########
@@ -308,17 +316,84 @@ private void replaceOrRegister(State state) {
     }
   }
 
-  /**
-   * Add a suffix of <code>current</code> starting at <code>fromIndex</code> (inclusive) to state
-   * <code>state</code>.
-   */
-  private void addSuffix(State state, CharSequence current, int fromIndex) {
-    final int len = current.length();
-    while (fromIndex < len) {
-      int cp = Character.codePointAt(current, fromIndex);
-      state = state.newState(cp);
-      fromIndex += Character.charCount(cp);
+  private static class CharacterBasedBuilder extends DaciukMihovAutomatonBuilder {
+    private final CharsRefBuilder scratch = new CharsRefBuilder();
+
+    @Override
+    protected void doAdd(BytesRef current) {
+      // Convert the input UTF-8 bytes to CharsRef so we can use the code points as our transition
+      // labels.
+      scratch.copyUTF8Bytes(current);

Review Comment:
   That's correct.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.

gsmiller commented on PR #12320:
URL: https://github.com/apache/lucene/pull/12320#issuecomment-1573976414

   Thanks @mikemccand! Appreciate you making time for this! 🎉 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a diff in pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.

gsmiller commented on code in PR #12320:
URL: https://github.com/apache/lucene/pull/12320#discussion_r1205840583


##########
lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java:
##########
@@ -308,17 +316,84 @@ private void replaceOrRegister(State state) {
     }
   }
 
-  /**
-   * Add a suffix of <code>current</code> starting at <code>fromIndex</code> (inclusive) to state
-   * <code>state</code>.
-   */
-  private void addSuffix(State state, CharSequence current, int fromIndex) {
-    final int len = current.length();
-    while (fromIndex < len) {
-      int cp = Character.codePointAt(current, fromIndex);
-      state = state.newState(cp);
-      fromIndex += Character.charCount(cp);
+  private static class CharacterBasedBuilder extends DaciukMihovAutomatonBuilder {
+    private final CharsRefBuilder scratch = new CharsRefBuilder();
+
+    @Override
+    protected void doAdd(BytesRef current) {
+      // Convert the input UTF-8 bytes to CharsRef so we can use the code points as our transition
+      // labels.
+      scratch.copyUTF8Bytes(current);
+      CharsRef currentChars = scratch.get();
+
+      // Descend in the automaton (find matching prefix).
+      int pos = 0, max = currentChars.length();
+      State state = root;
+      for (; ; ) {
+        assert pos <= max;
+        if (pos == max) {
+          break;
+        }
+
+        int codePoint = Character.codePointAt(currentChars, pos);
+        State next = state.lastChild(codePoint);
+        if (next == null) {
+          break;
+        }
+
+        state = next;
+        pos += Character.charCount(codePoint);
+      }
+
+      if (state.hasChildren()) replaceOrRegister(state);
+
+      addSuffix(state, currentChars, pos);
+    }
+
+    /**
+     * Add a suffix of <code>current</code> starting at <code>fromIndex</code> (inclusive) to state
+     * <code>state</code>.
+     */
+    private static void addSuffix(State state, CharSequence current, int fromIndex) {
+      final int len = current.length();
+      while (fromIndex < len) {
+        int cp = Character.codePointAt(current, fromIndex);
+        state = state.newState(cp);
+        fromIndex += Character.charCount(cp);
+      }
+      state.is_final = true;

Review Comment:
   Oh, I like that idea!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller merged pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.

gsmiller merged PR #12320:
URL: https://github.com/apache/lucene/pull/12320


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit

Posted by "jpountz (via GitHub)" <gi...@apache.org>.

jpountz commented on PR #12320:
URL: https://github.com/apache/lucene/pull/12320#issuecomment-1644230482

   We have had 3 failures of `TestStringsToAutomaton` on Policeman/Apache Jenkins since this change was merged that we were not getting before, so I wonder if it's related. I opened https://github.com/apache/lucene/issues/12451 earlier today that has a reproducible seed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a diff in pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.

gsmiller commented on code in PR #12320:
URL: https://github.com/apache/lucene/pull/12320#discussion_r1206096086


##########
lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java:
##########
@@ -308,17 +316,84 @@ private void replaceOrRegister(State state) {
     }
   }
 
-  /**
-   * Add a suffix of <code>current</code> starting at <code>fromIndex</code> (inclusive) to state
-   * <code>state</code>.
-   */
-  private void addSuffix(State state, CharSequence current, int fromIndex) {
-    final int len = current.length();
-    while (fromIndex < len) {
-      int cp = Character.codePointAt(current, fromIndex);
-      state = state.newState(cp);
-      fromIndex += Character.charCount(cp);
+  private static class CharacterBasedBuilder extends DaciukMihovAutomatonBuilder {
+    private final CharsRefBuilder scratch = new CharsRefBuilder();
+
+    @Override
+    protected void doAdd(BytesRef current) {
+      // Convert the input UTF-8 bytes to CharsRef so we can use the code points as our transition
+      // labels.
+      scratch.copyUTF8Bytes(current);
+      CharsRef currentChars = scratch.get();
+
+      // Descend in the automaton (find matching prefix).
+      int pos = 0, max = currentChars.length();
+      State state = root;
+      for (; ; ) {
+        assert pos <= max;
+        if (pos == max) {
+          break;
+        }
+
+        int codePoint = Character.codePointAt(currentChars, pos);
+        State next = state.lastChild(codePoint);
+        if (next == null) {
+          break;
+        }
+
+        state = next;
+        pos += Character.charCount(codePoint);
+      }
+
+      if (state.hasChildren()) replaceOrRegister(state);
+
+      addSuffix(state, currentChars, pos);
+    }
+
+    /**
+     * Add a suffix of <code>current</code> starting at <code>fromIndex</code> (inclusive) to state
+     * <code>state</code>.
+     */
+    private static void addSuffix(State state, CharSequence current, int fromIndex) {
+      final int len = current.length();
+      while (fromIndex < len) {
+        int cp = Character.codePointAt(current, fromIndex);
+        state = state.newState(cp);

Review Comment:
   I looked briefly at what it would take to build directly and not convert at the end, and I think it's better tackled as a follow-up. It will require a little bit of work to handle the "minimization as we go" bit without our own intermediate state representation. I'll open a spin-off issue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a diff in pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.

gsmiller commented on code in PR #12320:
URL: https://github.com/apache/lucene/pull/12320#discussion_r1205833951


##########
lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java:
##########
@@ -308,17 +316,84 @@ private void replaceOrRegister(State state) {
     }
   }
 
-  /**
-   * Add a suffix of <code>current</code> starting at <code>fromIndex</code> (inclusive) to state
-   * <code>state</code>.
-   */
-  private void addSuffix(State state, CharSequence current, int fromIndex) {
-    final int len = current.length();
-    while (fromIndex < len) {
-      int cp = Character.codePointAt(current, fromIndex);
-      state = state.newState(cp);
-      fromIndex += Character.charCount(cp);
+  private static class CharacterBasedBuilder extends DaciukMihovAutomatonBuilder {
+    private final CharsRefBuilder scratch = new CharsRefBuilder();
+
+    @Override
+    protected void doAdd(BytesRef current) {
+      // Convert the input UTF-8 bytes to CharsRef so we can use the code points as our transition
+      // labels.
+      scratch.copyUTF8Bytes(current);
+      CharsRef currentChars = scratch.get();
+
+      // Descend in the automaton (find matching prefix).
+      int pos = 0, max = currentChars.length();
+      State state = root;
+      for (; ; ) {
+        assert pos <= max;
+        if (pos == max) {
+          break;
+        }
+
+        int codePoint = Character.codePointAt(currentChars, pos);
+        State next = state.lastChild(codePoint);
+        if (next == null) {
+          break;
+        }
+
+        state = next;
+        pos += Character.charCount(codePoint);
+      }
+
+      if (state.hasChildren()) replaceOrRegister(state);
+
+      addSuffix(state, currentChars, pos);
+    }
+
+    /**
+     * Add a suffix of <code>current</code> starting at <code>fromIndex</code> (inclusive) to state
+     * <code>state</code>.
+     */
+    private static void addSuffix(State state, CharSequence current, int fromIndex) {
+      final int len = current.length();
+      while (fromIndex < len) {
+        int cp = Character.codePointAt(current, fromIndex);
+        state = state.newState(cp);

Review Comment:
   This is minimizing through the `replaceOrRegister` method that gets called when "moving on" to a new suffix. Once the common prefix has been found, we can minimize its most recently added transition since it's now immutable (thanks to adding terms in sorted order).
   
   Also, +1 to the idea of building to an `Automaton` directly instead of going through `convert` at the end.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a diff in pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.

gsmiller commented on code in PR #12320:
URL: https://github.com/apache/lucene/pull/12320#discussion_r1206158978


##########
lucene/CHANGES.txt:
##########
@@ -139,6 +139,9 @@ Improvements
 
 * GITHUB#12305: Minor cleanup and improvements to DaciukMihovAutomatonBuilder. (Greg Miller)
 
+* GITHUB#12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit.

Review Comment:
   Kept the old name here since I'm proposing this change for 9.x (and the rename will come in 10)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mikemccand commented on a diff in pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.

mikemccand commented on code in PR #12320:
URL: https://github.com/apache/lucene/pull/12320#discussion_r1205600165


##########
lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java:
##########
@@ -308,17 +316,84 @@ private void replaceOrRegister(State state) {
     }
   }
 
-  /**
-   * Add a suffix of <code>current</code> starting at <code>fromIndex</code> (inclusive) to state
-   * <code>state</code>.
-   */
-  private void addSuffix(State state, CharSequence current, int fromIndex) {
-    final int len = current.length();
-    while (fromIndex < len) {
-      int cp = Character.codePointAt(current, fromIndex);
-      state = state.newState(cp);
-      fromIndex += Character.charCount(cp);
+  private static class CharacterBasedBuilder extends DaciukMihovAutomatonBuilder {

Review Comment:
   `final` too?



##########
lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java:
##########
@@ -308,17 +316,84 @@ private void replaceOrRegister(State state) {
     }
   }
 
-  /**
-   * Add a suffix of <code>current</code> starting at <code>fromIndex</code> (inclusive) to state
-   * <code>state</code>.
-   */
-  private void addSuffix(State state, CharSequence current, int fromIndex) {
-    final int len = current.length();
-    while (fromIndex < len) {
-      int cp = Character.codePointAt(current, fromIndex);
-      state = state.newState(cp);
-      fromIndex += Character.charCount(cp);
+  private static class CharacterBasedBuilder extends DaciukMihovAutomatonBuilder {
+    private final CharsRefBuilder scratch = new CharsRefBuilder();
+
+    @Override
+    protected void doAdd(BytesRef current) {
+      // Convert the input UTF-8 bytes to CharsRef so we can use the code points as our transition
+      // labels.
+      scratch.copyUTF8Bytes(current);
+      CharsRef currentChars = scratch.get();
+
+      // Descend in the automaton (find matching prefix).
+      int pos = 0, max = currentChars.length();
+      State state = root;
+      for (; ; ) {
+        assert pos <= max;
+        if (pos == max) {
+          break;
+        }
+
+        int codePoint = Character.codePointAt(currentChars, pos);
+        State next = state.lastChild(codePoint);
+        if (next == null) {
+          break;
+        }
+
+        state = next;
+        pos += Character.charCount(codePoint);
+      }
+
+      if (state.hasChildren()) replaceOrRegister(state);
+
+      addSuffix(state, currentChars, pos);
+    }
+
+    /**
+     * Add a suffix of <code>current</code> starting at <code>fromIndex</code> (inclusive) to state
+     * <code>state</code>.
+     */
+    private static void addSuffix(State state, CharSequence current, int fromIndex) {
+      final int len = current.length();
+      while (fromIndex < len) {
+        int cp = Character.codePointAt(current, fromIndex);
+        state = state.newState(cp);

Review Comment:
   Hmm, I wonder how this is creating a minimal Automaton?  It seems to create a new path for every suffix without sharing the common suffixes?
   
   (This is not a problem with this PR but rather a pre-existing issue and likely my not understanding this algorithm!).
   
   Actually, I think this is nearly the same algorithm as the FST Builder, just applied to automaton (no outputs) instead of FST.
   
   Edit: maybe minimizing the "tail" of the automaton happens in `convert`?
   
   Edit 2: actually, I think we could maybe further optimize this builder to directly build `Automaton` instead of first creating its intermediate (and more RAM consuming?) automaton representation.  Future work :)



##########
lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java:
##########
@@ -308,17 +316,84 @@ private void replaceOrRegister(State state) {
     }
   }
 
-  /**
-   * Add a suffix of <code>current</code> starting at <code>fromIndex</code> (inclusive) to state
-   * <code>state</code>.
-   */
-  private void addSuffix(State state, CharSequence current, int fromIndex) {
-    final int len = current.length();
-    while (fromIndex < len) {
-      int cp = Character.codePointAt(current, fromIndex);
-      state = state.newState(cp);
-      fromIndex += Character.charCount(cp);
+  private static class CharacterBasedBuilder extends DaciukMihovAutomatonBuilder {
+    private final CharsRefBuilder scratch = new CharsRefBuilder();
+
+    @Override
+    protected void doAdd(BytesRef current) {
+      // Convert the input UTF-8 bytes to CharsRef so we can use the code points as our transition
+      // labels.
+      scratch.copyUTF8Bytes(current);
+      CharsRef currentChars = scratch.get();
+
+      // Descend in the automaton (find matching prefix).
+      int pos = 0, max = currentChars.length();
+      State state = root;
+      for (; ; ) {
+        assert pos <= max;
+        if (pos == max) {
+          break;
+        }
+
+        int codePoint = Character.codePointAt(currentChars, pos);
+        State next = state.lastChild(codePoint);
+        if (next == null) {
+          break;
+        }
+
+        state = next;
+        pos += Character.charCount(codePoint);
+      }
+
+      if (state.hasChildren()) replaceOrRegister(state);
+
+      addSuffix(state, currentChars, pos);
+    }
+
+    /**
+     * Add a suffix of <code>current</code> starting at <code>fromIndex</code> (inclusive) to state
+     * <code>state</code>.
+     */
+    private static void addSuffix(State state, CharSequence current, int fromIndex) {
+      final int len = current.length();
+      while (fromIndex < len) {
+        int cp = Character.codePointAt(current, fromIndex);
+        state = state.newState(cp);
+        fromIndex += Character.charCount(cp);
+      }
+      state.is_final = true;

Review Comment:
   Do we have a unit test for this class that generates random strings in a smallish alphabet, uses this builder to create the minimal automaton, and then builds an inefficient automaton with the existing union methods, then minimizing in the end, then asserting that the two ways for creating the minimal automaton (simple yet slow, complex but fast) produce identical (isomorphic) automaton?



##########
lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java:
##########
@@ -308,17 +316,84 @@ private void replaceOrRegister(State state) {
     }
   }
 
-  /**
-   * Add a suffix of <code>current</code> starting at <code>fromIndex</code> (inclusive) to state
-   * <code>state</code>.
-   */
-  private void addSuffix(State state, CharSequence current, int fromIndex) {
-    final int len = current.length();
-    while (fromIndex < len) {
-      int cp = Character.codePointAt(current, fromIndex);
-      state = state.newState(cp);
-      fromIndex += Character.charCount(cp);
+  private static class CharacterBasedBuilder extends DaciukMihovAutomatonBuilder {
+    private final CharsRefBuilder scratch = new CharsRefBuilder();
+
+    @Override
+    protected void doAdd(BytesRef current) {
+      // Convert the input UTF-8 bytes to CharsRef so we can use the code points as our transition
+      // labels.
+      scratch.copyUTF8Bytes(current);

Review Comment:
   It looks like we were already doing this conversion previously?  So this change is not adding more cost in the `CharsRef` case?



##########
lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java:
##########
@@ -308,17 +316,84 @@ private void replaceOrRegister(State state) {
     }
   }
 
-  /**
-   * Add a suffix of <code>current</code> starting at <code>fromIndex</code> (inclusive) to state
-   * <code>state</code>.
-   */
-  private void addSuffix(State state, CharSequence current, int fromIndex) {
-    final int len = current.length();
-    while (fromIndex < len) {
-      int cp = Character.codePointAt(current, fromIndex);
-      state = state.newState(cp);
-      fromIndex += Character.charCount(cp);
+  private static class CharacterBasedBuilder extends DaciukMihovAutomatonBuilder {
+    private final CharsRefBuilder scratch = new CharsRefBuilder();
+
+    @Override
+    protected void doAdd(BytesRef current) {
+      // Convert the input UTF-8 bytes to CharsRef so we can use the code points as our transition
+      // labels.
+      scratch.copyUTF8Bytes(current);
+      CharsRef currentChars = scratch.get();
+
+      // Descend in the automaton (find matching prefix).
+      int pos = 0, max = currentChars.length();
+      State state = root;
+      for (; ; ) {
+        assert pos <= max;
+        if (pos == max) {
+          break;
+        }
+
+        int codePoint = Character.codePointAt(currentChars, pos);

Review Comment:
   We could also decode the next Unicode code point directly from the UTF-8 bytes, instead of converting up front to a `CharsRef`?  Or maybe just convert to `int[]` (`UnicodeUtil.UTF8toUTF32`)?
   
   If we did the former (decode directly from `BytesRef`) we could perhaps not even make subclasses here and just have a small `if` in each of the add/addSuffix methods to pull the "next int" (either a UTF-8 unit or Unicode code point) on each loop.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.

gsmiller commented on PR #12320:
URL: https://github.com/apache/lucene/pull/12320#issuecomment-1564595369

   Thanks @mikemccand! Did a pass to address your comments. Much appreciated! I also added some testing around the minimization aspect of the automaton building. I think all feedback has been addressed at this point, but no rush on having another look. Thanks again!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.

gsmiller commented on PR #12320:
URL: https://github.com/apache/lucene/pull/12320#issuecomment-1644622398

   Thanks @jpountz. I'll have a look soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mikemccand commented on a diff in pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.

mikemccand commented on code in PR #12320:
URL: https://github.com/apache/lucene/pull/12320#discussion_r1205661713


##########
lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java:
##########
@@ -308,17 +316,84 @@ private void replaceOrRegister(State state) {
     }
   }
 
-  /**
-   * Add a suffix of <code>current</code> starting at <code>fromIndex</code> (inclusive) to state
-   * <code>state</code>.
-   */
-  private void addSuffix(State state, CharSequence current, int fromIndex) {
-    final int len = current.length();
-    while (fromIndex < len) {
-      int cp = Character.codePointAt(current, fromIndex);
-      state = state.newState(cp);
-      fromIndex += Character.charCount(cp);
+  private static class CharacterBasedBuilder extends DaciukMihovAutomatonBuilder {
+    private final CharsRefBuilder scratch = new CharsRefBuilder();
+
+    @Override
+    protected void doAdd(BytesRef current) {
+      // Convert the input UTF-8 bytes to CharsRef so we can use the code points as our transition
+      // labels.
+      scratch.copyUTF8Bytes(current);
+      CharsRef currentChars = scratch.get();
+
+      // Descend in the automaton (find matching prefix).
+      int pos = 0, max = currentChars.length();
+      State state = root;
+      for (; ; ) {
+        assert pos <= max;
+        if (pos == max) {
+          break;
+        }
+
+        int codePoint = Character.codePointAt(currentChars, pos);

Review Comment:
   I don't think Lucene has a `BytesRef` (UTF8) equivalent of `Character.codePointAt` and `Character.charCount` (byteCount), but it's quite trivial to implement ... UTF-8 makes this easy by just looking at the top (sign) bit of each byte to see if the character "continues", I think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a diff in pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.

gsmiller commented on code in PR #12320:
URL: https://github.com/apache/lucene/pull/12320#discussion_r1206103140


##########
lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java:
##########
@@ -308,17 +316,84 @@ private void replaceOrRegister(State state) {
     }
   }
 
-  /**
-   * Add a suffix of <code>current</code> starting at <code>fromIndex</code> (inclusive) to state
-   * <code>state</code>.
-   */
-  private void addSuffix(State state, CharSequence current, int fromIndex) {
-    final int len = current.length();
-    while (fromIndex < len) {
-      int cp = Character.codePointAt(current, fromIndex);
-      state = state.newState(cp);
-      fromIndex += Character.charCount(cp);
+  private static class CharacterBasedBuilder extends DaciukMihovAutomatonBuilder {

Review Comment:
   Removed these classes since I was able to make some simplifications based on your other feedback. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] mikemccand commented on a diff in pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.

mikemccand commented on code in PR #12320:
URL: https://github.com/apache/lucene/pull/12320#discussion_r1206892745


##########
lucene/core/src/java/org/apache/lucene/util/UnicodeUtil.java:
##########
@@ -477,38 +477,60 @@ public static int UTF8toUTF32(final BytesRef utf8, final int[] ints) {
     int utf8Upto = utf8.offset;
     final byte[] bytes = utf8.bytes;
     final int utf8Limit = utf8.offset + utf8.length;
+    UTF8CodePoint reuse = null;
     while (utf8Upto < utf8Limit) {
-      final int numBytes = utf8CodeLength[bytes[utf8Upto] & 0xFF];
-      int v = 0;
-      switch (numBytes) {
-        case 1:
-          ints[utf32Count++] = bytes[utf8Upto++];
-          continue;
-        case 2:
-          // 5 useful bits
-          v = bytes[utf8Upto++] & 31;
-          break;
-        case 3:
-          // 4 useful bits
-          v = bytes[utf8Upto++] & 15;
-          break;
-        case 4:
-          // 3 useful bits
-          v = bytes[utf8Upto++] & 7;
-          break;
-        default:
-          throw new IllegalArgumentException("invalid utf8");
-      }
+      reuse = codePointAt(bytes, utf8Upto, reuse);
+      ints[utf32Count++] = reuse.codePoint;
+      utf8Upto += reuse.codePointBytes;
+    }
 
-      // TODO: this may read past utf8's limit.
-      final int limit = utf8Upto + numBytes - 1;
-      while (utf8Upto < limit) {
-        v = v << 6 | bytes[utf8Upto++] & 63;
+    return utf32Count;
+  }
+
+  /**
+   * Computes the codepoint and codepoint length (in bytes) of the specified {@code offset} in the
+   * provided {@code utf8} byte array, assuming UTF8 encoding. As with other related methods in this
+   * class, this assumes valid UTF8 input and <strong>does not perform</strong> full UTF8
+   * validation.
+   *
+   * @throws IllegalArgumentException If invalid codepoint header byte occurs or the content is

Review Comment:
   I think we may also throw `ArrayIndexOutOfBoundException` on really badly not-UTF-8 `byte[]`?  The `utf8CodeLength` array is I think length 248 (256 - 8).  Also, it has a bunch of `v` in it, which I think are invalid UTF-8 first bytes, which should throw the `IllegalArgumentException`.
   
   Maybe either catch the AIOOBE and rethrow as IAE, or, soften the statement to say "throws various exceptions on invalid UTF-8, or, if the provided pos is NOT the start of a Unicode character".  I don't think we want to promise we will always detect invalid UTF-8 and throw a clean exception.



##########
lucene/core/src/java/org/apache/lucene/util/UnicodeUtil.java:
##########
@@ -477,38 +477,60 @@ public static int UTF8toUTF32(final BytesRef utf8, final int[] ints) {
     int utf8Upto = utf8.offset;
     final byte[] bytes = utf8.bytes;
     final int utf8Limit = utf8.offset + utf8.length;
+    UTF8CodePoint reuse = null;
     while (utf8Upto < utf8Limit) {
-      final int numBytes = utf8CodeLength[bytes[utf8Upto] & 0xFF];
-      int v = 0;
-      switch (numBytes) {
-        case 1:
-          ints[utf32Count++] = bytes[utf8Upto++];
-          continue;
-        case 2:
-          // 5 useful bits
-          v = bytes[utf8Upto++] & 31;
-          break;
-        case 3:
-          // 4 useful bits
-          v = bytes[utf8Upto++] & 15;
-          break;
-        case 4:
-          // 3 useful bits
-          v = bytes[utf8Upto++] & 7;
-          break;
-        default:
-          throw new IllegalArgumentException("invalid utf8");
-      }
+      reuse = codePointAt(bytes, utf8Upto, reuse);
+      ints[utf32Count++] = reuse.codePoint;
+      utf8Upto += reuse.codePointBytes;
+    }
 
-      // TODO: this may read past utf8's limit.
-      final int limit = utf8Upto + numBytes - 1;
-      while (utf8Upto < limit) {
-        v = v << 6 | bytes[utf8Upto++] & 63;
+    return utf32Count;
+  }
+
+  /**
+   * Computes the codepoint and codepoint length (in bytes) of the specified {@code offset} in the
+   * provided {@code utf8} byte array, assuming UTF8 encoding. As with other related methods in this
+   * class, this assumes valid UTF8 input and <strong>does not perform</strong> full UTF8
+   * validation.
+   *
+   * @throws IllegalArgumentException If invalid codepoint header byte occurs or the content is
+   *     prematurely truncated.
+   */
+  public static UTF8CodePoint codePointAt(byte[] utf8, int pos, UTF8CodePoint reuse) {
+    if (reuse == null) {
+      reuse = new UTF8CodePoint();
+    }
+
+    int leadByte = utf8[pos] & 0xFF;
+    int numBytes = utf8CodeLength[leadByte];
+    reuse.codePointBytes = numBytes;
+    int v;
+    switch (numBytes) {
+      case 1 -> {
+        reuse.codePoint = leadByte;
+        return reuse;
       }
-      ints[utf32Count++] = v;
+      case 2 -> v = leadByte & 31; // 5 useful bits
+      case 3 -> v = leadByte & 15; // 4 useful bits
+      case 4 -> v = leadByte & 7; // 3 useful bits
+      default -> throw new IllegalArgumentException("invalid utf8");

Review Comment:
   Maybe include the `Arrays.toString(utf8)` and `pos` in the exception message?  Or perhaps just the fragment where the malformed utf-8 started (`utf8[pos:` in Python syntax).



##########
lucene/core/src/java/org/apache/lucene/util/UnicodeUtil.java:
##########
@@ -477,38 +477,60 @@ public static int UTF8toUTF32(final BytesRef utf8, final int[] ints) {
     int utf8Upto = utf8.offset;
     final byte[] bytes = utf8.bytes;
     final int utf8Limit = utf8.offset + utf8.length;
+    UTF8CodePoint reuse = null;
     while (utf8Upto < utf8Limit) {
-      final int numBytes = utf8CodeLength[bytes[utf8Upto] & 0xFF];
-      int v = 0;
-      switch (numBytes) {
-        case 1:
-          ints[utf32Count++] = bytes[utf8Upto++];
-          continue;
-        case 2:
-          // 5 useful bits
-          v = bytes[utf8Upto++] & 31;
-          break;
-        case 3:
-          // 4 useful bits
-          v = bytes[utf8Upto++] & 15;
-          break;
-        case 4:
-          // 3 useful bits
-          v = bytes[utf8Upto++] & 7;
-          break;
-        default:
-          throw new IllegalArgumentException("invalid utf8");
-      }
+      reuse = codePointAt(bytes, utf8Upto, reuse);
+      ints[utf32Count++] = reuse.codePoint;
+      utf8Upto += reuse.codePointBytes;
+    }
 
-      // TODO: this may read past utf8's limit.
-      final int limit = utf8Upto + numBytes - 1;
-      while (utf8Upto < limit) {
-        v = v << 6 | bytes[utf8Upto++] & 63;
+    return utf32Count;
+  }
+
+  /**
+   * Computes the codepoint and codepoint length (in bytes) of the specified {@code offset} in the
+   * provided {@code utf8} byte array, assuming UTF8 encoding. As with other related methods in this
+   * class, this assumes valid UTF8 input and <strong>does not perform</strong> full UTF8
+   * validation.
+   *
+   * @throws IllegalArgumentException If invalid codepoint header byte occurs or the content is
+   *     prematurely truncated.
+   */
+  public static UTF8CodePoint codePointAt(byte[] utf8, int pos, UTF8CodePoint reuse) {
+    if (reuse == null) {
+      reuse = new UTF8CodePoint();
+    }
+
+    int leadByte = utf8[pos] & 0xFF;
+    int numBytes = utf8CodeLength[leadByte];
+    reuse.codePointBytes = numBytes;
+    int v;
+    switch (numBytes) {
+      case 1 -> {
+        reuse.codePoint = leadByte;
+        return reuse;
       }
-      ints[utf32Count++] = v;
+      case 2 -> v = leadByte & 31; // 5 useful bits
+      case 3 -> v = leadByte & 15; // 4 useful bits
+      case 4 -> v = leadByte & 7; // 3 useful bits
+      default -> throw new IllegalArgumentException("invalid utf8");
     }
 
-    return utf32Count;
+    // TODO: this may read past utf8's limit.

Review Comment:
   Ahh yes another `AIOOBE` case.  I think it's fine if we throw whatever exceptions if you pass invalid UTF-8.



##########
lucene/core/src/java/org/apache/lucene/util/UnicodeUtil.java:
##########
@@ -477,38 +477,60 @@ public static int UTF8toUTF32(final BytesRef utf8, final int[] ints) {
     int utf8Upto = utf8.offset;
     final byte[] bytes = utf8.bytes;
     final int utf8Limit = utf8.offset + utf8.length;
+    UTF8CodePoint reuse = null;
     while (utf8Upto < utf8Limit) {
-      final int numBytes = utf8CodeLength[bytes[utf8Upto] & 0xFF];
-      int v = 0;
-      switch (numBytes) {
-        case 1:
-          ints[utf32Count++] = bytes[utf8Upto++];
-          continue;
-        case 2:
-          // 5 useful bits
-          v = bytes[utf8Upto++] & 31;
-          break;
-        case 3:
-          // 4 useful bits
-          v = bytes[utf8Upto++] & 15;
-          break;
-        case 4:
-          // 3 useful bits
-          v = bytes[utf8Upto++] & 7;
-          break;
-        default:
-          throw new IllegalArgumentException("invalid utf8");
-      }
+      reuse = codePointAt(bytes, utf8Upto, reuse);
+      ints[utf32Count++] = reuse.codePoint;
+      utf8Upto += reuse.codePointBytes;
+    }
 
-      // TODO: this may read past utf8's limit.
-      final int limit = utf8Upto + numBytes - 1;
-      while (utf8Upto < limit) {
-        v = v << 6 | bytes[utf8Upto++] & 63;
+    return utf32Count;
+  }
+
+  /**
+   * Computes the codepoint and codepoint length (in bytes) of the specified {@code offset} in the
+   * provided {@code utf8} byte array, assuming UTF8 encoding. As with other related methods in this
+   * class, this assumes valid UTF8 input and <strong>does not perform</strong> full UTF8
+   * validation.
+   *
+   * @throws IllegalArgumentException If invalid codepoint header byte occurs or the content is
+   *     prematurely truncated.
+   */
+  public static UTF8CodePoint codePointAt(byte[] utf8, int pos, UTF8CodePoint reuse) {
+    if (reuse == null) {
+      reuse = new UTF8CodePoint();
+    }
+
+    int leadByte = utf8[pos] & 0xFF;
+    int numBytes = utf8CodeLength[leadByte];
+    reuse.codePointBytes = numBytes;
+    int v;
+    switch (numBytes) {
+      case 1 -> {
+        reuse.codePoint = leadByte;
+        return reuse;
       }
-      ints[utf32Count++] = v;
+      case 2 -> v = leadByte & 31; // 5 useful bits
+      case 3 -> v = leadByte & 15; // 4 useful bits
+      case 4 -> v = leadByte & 7; // 3 useful bits
+      default -> throw new IllegalArgumentException("invalid utf8");
     }
 
-    return utf32Count;
+    // TODO: this may read past utf8's limit.
+    final int limit = pos + numBytes;
+    pos++;
+    while (pos < limit) {
+      v = v << 6 | utf8[pos++] & 63;
+    }
+    reuse.codePoint = v;
+
+    return reuse;
+  }
+
+  /** Holds a codepoint along with the number of bytes required to represent it in UTF8 */
+  public static final class UTF8CodePoint {
+    public int codePoint;
+    public int codePointBytes;

Review Comment:
   Maybe rename to `numBytes`?  The `codePoint` prefix seems redundant.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a diff in pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.

gsmiller commented on code in PR #12320:
URL: https://github.com/apache/lucene/pull/12320#discussion_r1205841503


##########
lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java:
##########
@@ -308,17 +316,84 @@ private void replaceOrRegister(State state) {
     }
   }
 
-  /**
-   * Add a suffix of <code>current</code> starting at <code>fromIndex</code> (inclusive) to state
-   * <code>state</code>.
-   */
-  private void addSuffix(State state, CharSequence current, int fromIndex) {
-    final int len = current.length();
-    while (fromIndex < len) {
-      int cp = Character.codePointAt(current, fromIndex);
-      state = state.newState(cp);
-      fromIndex += Character.charCount(cp);
+  private static class CharacterBasedBuilder extends DaciukMihovAutomatonBuilder {
+    private final CharsRefBuilder scratch = new CharsRefBuilder();
+
+    @Override
+    protected void doAdd(BytesRef current) {
+      // Convert the input UTF-8 bytes to CharsRef so we can use the code points as our transition
+      // labels.
+      scratch.copyUTF8Bytes(current);
+      CharsRef currentChars = scratch.get();
+
+      // Descend in the automaton (find matching prefix).
+      int pos = 0, max = currentChars.length();
+      State state = root;
+      for (; ; ) {
+        assert pos <= max;
+        if (pos == max) {
+          break;
+        }
+
+        int codePoint = Character.codePointAt(currentChars, pos);

Review Comment:
   Hmm... yeah good idea. I'll explore this a bit since it could make the implementation simpler. Thanks for the idea!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.

gsmiller commented on PR #12320:
URL: https://github.com/apache/lucene/pull/12320#issuecomment-1563719902

   Updated based on the prior feedback, except for one outstanding testing suggestion. I'll have a look at that soon. I think the builder logic is much cleaner now between building string/binary automata.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a diff in pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.

gsmiller commented on code in PR #12320:
URL: https://github.com/apache/lucene/pull/12320#discussion_r1206972753


##########
lucene/core/src/java/org/apache/lucene/util/UnicodeUtil.java:
##########
@@ -477,38 +477,60 @@ public static int UTF8toUTF32(final BytesRef utf8, final int[] ints) {
     int utf8Upto = utf8.offset;
     final byte[] bytes = utf8.bytes;
     final int utf8Limit = utf8.offset + utf8.length;
+    UTF8CodePoint reuse = null;
     while (utf8Upto < utf8Limit) {
-      final int numBytes = utf8CodeLength[bytes[utf8Upto] & 0xFF];
-      int v = 0;
-      switch (numBytes) {
-        case 1:
-          ints[utf32Count++] = bytes[utf8Upto++];
-          continue;
-        case 2:
-          // 5 useful bits
-          v = bytes[utf8Upto++] & 31;
-          break;
-        case 3:
-          // 4 useful bits
-          v = bytes[utf8Upto++] & 15;
-          break;
-        case 4:
-          // 3 useful bits
-          v = bytes[utf8Upto++] & 7;
-          break;
-        default:
-          throw new IllegalArgumentException("invalid utf8");
-      }
+      reuse = codePointAt(bytes, utf8Upto, reuse);
+      ints[utf32Count++] = reuse.codePoint;
+      utf8Upto += reuse.codePointBytes;
+    }
 
-      // TODO: this may read past utf8's limit.
-      final int limit = utf8Upto + numBytes - 1;
-      while (utf8Upto < limit) {
-        v = v << 6 | bytes[utf8Upto++] & 63;
+    return utf32Count;
+  }
+
+  /**
+   * Computes the codepoint and codepoint length (in bytes) of the specified {@code offset} in the
+   * provided {@code utf8} byte array, assuming UTF8 encoding. As with other related methods in this
+   * class, this assumes valid UTF8 input and <strong>does not perform</strong> full UTF8
+   * validation.
+   *
+   * @throws IllegalArgumentException If invalid codepoint header byte occurs or the content is
+   *     prematurely truncated.
+   */
+  public static UTF8CodePoint codePointAt(byte[] utf8, int pos, UTF8CodePoint reuse) {
+    if (reuse == null) {
+      reuse = new UTF8CodePoint();
+    }
+
+    int leadByte = utf8[pos] & 0xFF;
+    int numBytes = utf8CodeLength[leadByte];
+    reuse.codePointBytes = numBytes;
+    int v;
+    switch (numBytes) {
+      case 1 -> {
+        reuse.codePoint = leadByte;
+        return reuse;
       }
-      ints[utf32Count++] = v;
+      case 2 -> v = leadByte & 31; // 5 useful bits
+      case 3 -> v = leadByte & 15; // 4 useful bits
+      case 4 -> v = leadByte & 7; // 3 useful bits
+      default -> throw new IllegalArgumentException("invalid utf8");

Review Comment:
   How about the header byte that resulted in an illegal parse? I'm a little nervous of including the whole substring of bytes as it has unbounded length and could be a bit unwieldy?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a diff in pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.

gsmiller commented on code in PR #12320:
URL: https://github.com/apache/lucene/pull/12320#discussion_r1206966623


##########
lucene/core/src/java/org/apache/lucene/util/UnicodeUtil.java:
##########
@@ -477,38 +477,60 @@ public static int UTF8toUTF32(final BytesRef utf8, final int[] ints) {
     int utf8Upto = utf8.offset;
     final byte[] bytes = utf8.bytes;
     final int utf8Limit = utf8.offset + utf8.length;
+    UTF8CodePoint reuse = null;
     while (utf8Upto < utf8Limit) {
-      final int numBytes = utf8CodeLength[bytes[utf8Upto] & 0xFF];
-      int v = 0;
-      switch (numBytes) {
-        case 1:
-          ints[utf32Count++] = bytes[utf8Upto++];
-          continue;
-        case 2:
-          // 5 useful bits
-          v = bytes[utf8Upto++] & 31;
-          break;
-        case 3:
-          // 4 useful bits
-          v = bytes[utf8Upto++] & 15;
-          break;
-        case 4:
-          // 3 useful bits
-          v = bytes[utf8Upto++] & 7;
-          break;
-        default:
-          throw new IllegalArgumentException("invalid utf8");
-      }
+      reuse = codePointAt(bytes, utf8Upto, reuse);
+      ints[utf32Count++] = reuse.codePoint;
+      utf8Upto += reuse.codePointBytes;
+    }
 
-      // TODO: this may read past utf8's limit.
-      final int limit = utf8Upto + numBytes - 1;
-      while (utf8Upto < limit) {
-        v = v << 6 | bytes[utf8Upto++] & 63;
+    return utf32Count;
+  }
+
+  /**
+   * Computes the codepoint and codepoint length (in bytes) of the specified {@code offset} in the
+   * provided {@code utf8} byte array, assuming UTF8 encoding. As with other related methods in this
+   * class, this assumes valid UTF8 input and <strong>does not perform</strong> full UTF8
+   * validation.
+   *
+   * @throws IllegalArgumentException If invalid codepoint header byte occurs or the content is

Review Comment:
   You're correct that it could AIOOBE on a particularly malformed header byte. I think the `v` business is OK since the default switch case translates that to IAE, but I agree with your suggestion to make a more general statement that this method may do all sort of terrible and unexpected things if you feed it invalid utf8 (or reference an invalid start position)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org