You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "mikemccand (via GitHub)" <gi...@apache.org> on 2023/05/25 15:06:59 UTC

[GitHub] [lucene] mikemccand commented on a diff in pull request #12320: Add "direct to binary" option for DaciukMihovAutomatonBuilder and use it in TermInSetQuery#visit

mikemccand commented on code in PR #12320:
URL: https://github.com/apache/lucene/pull/12320#discussion_r1205600165


##########
lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java:
##########
@@ -308,17 +316,84 @@ private void replaceOrRegister(State state) {
     }
   }
 
-  /**
-   * Add a suffix of <code>current</code> starting at <code>fromIndex</code> (inclusive) to state
-   * <code>state</code>.
-   */
-  private void addSuffix(State state, CharSequence current, int fromIndex) {
-    final int len = current.length();
-    while (fromIndex < len) {
-      int cp = Character.codePointAt(current, fromIndex);
-      state = state.newState(cp);
-      fromIndex += Character.charCount(cp);
+  private static class CharacterBasedBuilder extends DaciukMihovAutomatonBuilder {

Review Comment:
   `final` too?



##########
lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java:
##########
@@ -308,17 +316,84 @@ private void replaceOrRegister(State state) {
     }
   }
 
-  /**
-   * Add a suffix of <code>current</code> starting at <code>fromIndex</code> (inclusive) to state
-   * <code>state</code>.
-   */
-  private void addSuffix(State state, CharSequence current, int fromIndex) {
-    final int len = current.length();
-    while (fromIndex < len) {
-      int cp = Character.codePointAt(current, fromIndex);
-      state = state.newState(cp);
-      fromIndex += Character.charCount(cp);
+  private static class CharacterBasedBuilder extends DaciukMihovAutomatonBuilder {
+    private final CharsRefBuilder scratch = new CharsRefBuilder();
+
+    @Override
+    protected void doAdd(BytesRef current) {
+      // Convert the input UTF-8 bytes to CharsRef so we can use the code points as our transition
+      // labels.
+      scratch.copyUTF8Bytes(current);
+      CharsRef currentChars = scratch.get();
+
+      // Descend in the automaton (find matching prefix).
+      int pos = 0, max = currentChars.length();
+      State state = root;
+      for (; ; ) {
+        assert pos <= max;
+        if (pos == max) {
+          break;
+        }
+
+        int codePoint = Character.codePointAt(currentChars, pos);
+        State next = state.lastChild(codePoint);
+        if (next == null) {
+          break;
+        }
+
+        state = next;
+        pos += Character.charCount(codePoint);
+      }
+
+      if (state.hasChildren()) replaceOrRegister(state);
+
+      addSuffix(state, currentChars, pos);
+    }
+
+    /**
+     * Add a suffix of <code>current</code> starting at <code>fromIndex</code> (inclusive) to state
+     * <code>state</code>.
+     */
+    private static void addSuffix(State state, CharSequence current, int fromIndex) {
+      final int len = current.length();
+      while (fromIndex < len) {
+        int cp = Character.codePointAt(current, fromIndex);
+        state = state.newState(cp);

Review Comment:
   Hmm, I wonder how this is creating a minimal Automaton?  It seems to create a new path for every suffix without sharing the common suffixes?
   
   (This is not a problem with this PR but rather a pre-existing issue and likely my not understanding this algorithm!).
   
   Actually, I think this is nearly the same algorithm as the FST Builder, just applied to automaton (no outputs) instead of FST.
   
   Edit: maybe minimizing the "tail" of the automaton happens in `convert`?
   
   Edit 2: actually, I think we could maybe further optimize this builder to directly build `Automaton` instead of first creating its intermediate (and more RAM consuming?) automaton representation.  Future work :)



##########
lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java:
##########
@@ -308,17 +316,84 @@ private void replaceOrRegister(State state) {
     }
   }
 
-  /**
-   * Add a suffix of <code>current</code> starting at <code>fromIndex</code> (inclusive) to state
-   * <code>state</code>.
-   */
-  private void addSuffix(State state, CharSequence current, int fromIndex) {
-    final int len = current.length();
-    while (fromIndex < len) {
-      int cp = Character.codePointAt(current, fromIndex);
-      state = state.newState(cp);
-      fromIndex += Character.charCount(cp);
+  private static class CharacterBasedBuilder extends DaciukMihovAutomatonBuilder {
+    private final CharsRefBuilder scratch = new CharsRefBuilder();
+
+    @Override
+    protected void doAdd(BytesRef current) {
+      // Convert the input UTF-8 bytes to CharsRef so we can use the code points as our transition
+      // labels.
+      scratch.copyUTF8Bytes(current);
+      CharsRef currentChars = scratch.get();
+
+      // Descend in the automaton (find matching prefix).
+      int pos = 0, max = currentChars.length();
+      State state = root;
+      for (; ; ) {
+        assert pos <= max;
+        if (pos == max) {
+          break;
+        }
+
+        int codePoint = Character.codePointAt(currentChars, pos);
+        State next = state.lastChild(codePoint);
+        if (next == null) {
+          break;
+        }
+
+        state = next;
+        pos += Character.charCount(codePoint);
+      }
+
+      if (state.hasChildren()) replaceOrRegister(state);
+
+      addSuffix(state, currentChars, pos);
+    }
+
+    /**
+     * Add a suffix of <code>current</code> starting at <code>fromIndex</code> (inclusive) to state
+     * <code>state</code>.
+     */
+    private static void addSuffix(State state, CharSequence current, int fromIndex) {
+      final int len = current.length();
+      while (fromIndex < len) {
+        int cp = Character.codePointAt(current, fromIndex);
+        state = state.newState(cp);
+        fromIndex += Character.charCount(cp);
+      }
+      state.is_final = true;

Review Comment:
   Do we have a unit test for this class that generates random strings in a smallish alphabet, uses this builder to create the minimal automaton, and then builds an inefficient automaton with the existing union methods, then minimizing in the end, then asserting that the two ways for creating the minimal automaton (simple yet slow, complex but fast) produce identical (isomorphic) automaton?



##########
lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java:
##########
@@ -308,17 +316,84 @@ private void replaceOrRegister(State state) {
     }
   }
 
-  /**
-   * Add a suffix of <code>current</code> starting at <code>fromIndex</code> (inclusive) to state
-   * <code>state</code>.
-   */
-  private void addSuffix(State state, CharSequence current, int fromIndex) {
-    final int len = current.length();
-    while (fromIndex < len) {
-      int cp = Character.codePointAt(current, fromIndex);
-      state = state.newState(cp);
-      fromIndex += Character.charCount(cp);
+  private static class CharacterBasedBuilder extends DaciukMihovAutomatonBuilder {
+    private final CharsRefBuilder scratch = new CharsRefBuilder();
+
+    @Override
+    protected void doAdd(BytesRef current) {
+      // Convert the input UTF-8 bytes to CharsRef so we can use the code points as our transition
+      // labels.
+      scratch.copyUTF8Bytes(current);

Review Comment:
   It looks like we were already doing this conversion previously?  So this change is not adding more cost in the `CharsRef` case?



##########
lucene/core/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java:
##########
@@ -308,17 +316,84 @@ private void replaceOrRegister(State state) {
     }
   }
 
-  /**
-   * Add a suffix of <code>current</code> starting at <code>fromIndex</code> (inclusive) to state
-   * <code>state</code>.
-   */
-  private void addSuffix(State state, CharSequence current, int fromIndex) {
-    final int len = current.length();
-    while (fromIndex < len) {
-      int cp = Character.codePointAt(current, fromIndex);
-      state = state.newState(cp);
-      fromIndex += Character.charCount(cp);
+  private static class CharacterBasedBuilder extends DaciukMihovAutomatonBuilder {
+    private final CharsRefBuilder scratch = new CharsRefBuilder();
+
+    @Override
+    protected void doAdd(BytesRef current) {
+      // Convert the input UTF-8 bytes to CharsRef so we can use the code points as our transition
+      // labels.
+      scratch.copyUTF8Bytes(current);
+      CharsRef currentChars = scratch.get();
+
+      // Descend in the automaton (find matching prefix).
+      int pos = 0, max = currentChars.length();
+      State state = root;
+      for (; ; ) {
+        assert pos <= max;
+        if (pos == max) {
+          break;
+        }
+
+        int codePoint = Character.codePointAt(currentChars, pos);

Review Comment:
   We could also decode the next Unicode code point directly from the UTF-8 bytes, instead of converting up front to a `CharsRef`?  Or maybe just convert to `int[]` (`UnicodeUtil.UTF8toUTF32`)?
   
   If we did the former (decode directly from `BytesRef`) we could perhaps not even make subclasses here and just have a small `if` in each of the add/addSuffix methods to pull the "next int" (either a UTF-8 unit or Unicode code point) on each loop.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org