You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "mikemccand (via GitHub)" <gi...@apache.org> on 2023/05/18 12:03:58 UTC

[GitHub] [lucene] mikemccand opened a new pull request, #12310: #12276: rename DaciukMihovAutomatonBuilder to StringsToAutomaton

mikemccand opened a new pull request, #12310:
URL: https://github.com/apache/lucene/pull/12310

   This is just a rote rename of this helpful class.  I plan 10.0 only since it is technically a public API break.  I added a line to MIGRATE.txt too.
   
   Closes #12276
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mikemccand commented on pull request #12310: #12276: rename DaciukMihovAutomatonBuilder to StringsToAutomaton

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand commented on PR #12310:
URL: https://github.com/apache/lucene/pull/12310#issuecomment-1555415525

   > Yeah we should explore a binary version. Even if it doesn't speedup TermInSetQuery.
   
   +1
   
   > Pretty sure I added a comment along the lines of "we should not do this"
   
   Oh yeah!  I think you are referring to this super scary code!
   
   ```
     @Override
     public void visit(QueryVisitor visitor) {
       if (visitor.acceptField(field) == false) {
         return;
       }
       if (termData.size() == 1) {
         visitor.consumeTerms(this, new Term(field, termData.iterator().next()));
       }
       if (termData.size() > 1) {
         visitor.consumeTermsMatching(this, field, this::asByteRunAutomaton);
       }
     }
   
     // TODO: this is extremely slow. we should not be doing this.
     private ByteRunAutomaton asByteRunAutomaton() {
       TermIterator iterator = termData.iterator();
       List<Automaton> automata = new ArrayList<>();
       for (BytesRef term = iterator.next(); term != null; term = iterator.next()) {
         automata.add(Automata.makeBinary(term));
       }
       Automaton automaton =
           Operations.determinize(
               Operations.union(automata), Operations.DEFAULT_DETERMINIZE_WORK_LIMIT);
       return new CompiledAutomaton(automaton).runAutomaton;
     }
   ```
   
   We should indeed make a `BytesRefsToAutomaton`!  I'll open a spinoff.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mikemccand commented on pull request #12310: #12276: rename DaciukMihovAutomatonBuilder to StringsToAutomaton

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand commented on PR #12310:
URL: https://github.com/apache/lucene/pull/12310#issuecomment-1555414840

   > LGTM. Apologies for the merge conflict I created for you (but thanks for the review on that PR!).
   
   No worries!  I'll resolve and merge soon!  Thanks for the review @gsmiller.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mikemccand merged pull request #12310: #12276: rename DaciukMihovAutomatonBuilder to StringsToAutomaton

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand merged PR #12310:
URL: https://github.com/apache/lucene/pull/12310


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] rmuir commented on pull request #12310: #12276: rename DaciukMihovAutomatonBuilder to StringsToAutomaton

Posted by "rmuir (via GitHub)" <gi...@apache.org>.
rmuir commented on PR #12310:
URL: https://github.com/apache/lucene/pull/12310#issuecomment-1553633433

   Yeah we should explore a binary version. Even if it doesn't speedup TermInSetQuery.
   
   TermInSetQuery has/had a super trappy visitor method that builds an automaton from it's sorted terms in a very slow way for some visitor method. Pretty sure I added a comment along the lines of "we should not do this". We could at least fix that trap. Check my assumptions and memory as I am on a phone


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12310: #12276: rename DaciukMihovAutomatonBuilder to StringsToAutomaton

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12310:
URL: https://github.com/apache/lucene/pull/12310#issuecomment-1556069606

   @mikemccand the `build` method you reference above in `DaciukMihovAutomatonBuilder` build an automaton with code points as transition labels, but I think we need a "compiled" binary automaton for this visitor? We can have `CompiledAutomaton` do this conversion, but it's pretty wasteful to build one way then convert it when we could build directly to binary? I opened a separate PR that can do a direct binary build over in #12320. I might be completely overlooking a simpler way to do this with code we already have though, so I'm happy to close that PR out in favor of a simpler approach if one exists. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12310: #12276: rename DaciukMihovAutomatonBuilder to StringsToAutomaton

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12310:
URL: https://github.com/apache/lucene/pull/12310#issuecomment-1555927824

   @mikemccand where's the build method you're referencing? I took a pass at creating a "direct to binary" version of the Daciuk-Mihov algorithm in #12312. We could fold that into this work if we want? Or treat it as a separate follow-up task. But maybe there's already some direct binary building logic I wasn't aware of and my changes aren't needed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] mikemccand commented on pull request #12310: #12276: rename DaciukMihovAutomatonBuilder to StringsToAutomaton

Posted by "mikemccand (via GitHub)" <gi...@apache.org>.
mikemccand commented on PR #12310:
URL: https://github.com/apache/lucene/pull/12310#issuecomment-1555416664

   Actually, the terms must be sorted in Unicode code point order, and, we do have the builder for `BytesRef` already: `public static Automaton build(Collection<BytesRef> input) {`.  So I think we should just fix `TermInSetQuery`'s scary code to use this binary builder?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org


[GitHub] [lucene] gsmiller commented on pull request #12310: #12276: rename DaciukMihovAutomatonBuilder to StringsToAutomaton

Posted by "gsmiller (via GitHub)" <gi...@apache.org>.
gsmiller commented on PR #12310:
URL: https://github.com/apache/lucene/pull/12310#issuecomment-1556069972

   Separately, in terms of renaming this class, I'm still in favor of it but I also wonder if we should just consider making it pkg-private and proxying the build functionality through `Automata#makeStringUnion`? That's a pretty clean entry-point for users, and I'm not sure there's any value in having this class be public? I created a spin-off issue #12321


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org