You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by dsmiley <gi...@git.apache.org> on 2018/05/25 21:58:36 UTC

[GitHub] lucene-solr pull request #384: LUCENE-8332 move CompletionTokenStream to Con...

GitHub user dsmiley opened a pull request:

    https://github.com/apache/lucene-solr/pull/384

    LUCENE-8332 move CompletionTokenStream to ConcatenateGraphFilter

    See JIRA issue for overview comments.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dsmiley/lucene-solr LUCENE-8332

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/lucene-solr/pull/384.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #384
    
----
commit d60ba6a1687566efdcee1f29f1666f4ec9d4fba5
Author: David Smiley <ds...@...>
Date:   2018-05-25T21:53:02Z

    initial move of CompletionTokenStream to ConcatenateGraphFilter

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[GitHub] lucene-solr issue #384: LUCENE-8332 move CompletionTokenStream to Concatenat...

Posted by jimczi <gi...@git.apache.org>.

Github user jimczi commented on the issue:

    https://github.com/apache/lucene-solr/pull/384
  
    > If we really want to insist that every component be cranky when mistreated, then perhaps we could have toAutomaton not close the inputTokenStream so that we can delegate that from CGF's close?
    
    I think it's the right thing to do and the change is minimal. Good catch !


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[GitHub] lucene-solr pull request #384: LUCENE-8332 move CompletionTokenStream to Con...

Posted by dsmiley <gi...@git.apache.org>.

Github user dsmiley commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/384#discussion_r191571833
  
    --- Diff: lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ConcatenateGraphFilter.java ---
    @@ -31,80 +33,106 @@
     import org.apache.lucene.util.IOUtils;
     import org.apache.lucene.util.IntsRef;
     import org.apache.lucene.util.automaton.Automaton;
    -import org.apache.lucene.util.automaton.FiniteStringsIterator;
     import org.apache.lucene.util.automaton.LimitedFiniteStringsIterator;
     import org.apache.lucene.util.automaton.Operations;
    +import org.apache.lucene.util.automaton.TooComplexToDeterminizeException;
     import org.apache.lucene.util.automaton.Transition;
     import org.apache.lucene.util.fst.Util;
     
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.DEFAULT_MAX_GRAPH_EXPANSIONS;
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.DEFAULT_PRESERVE_POSITION_INCREMENTS;
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.DEFAULT_PRESERVE_SEP;
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.SEP_LABEL;
    -
     /**
    - * Token stream which converts a provided token stream to an automaton.
    - * The accepted strings enumeration from the automaton are available through the
    - * {@link org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute} attribute
    - * The token stream uses a {@link org.apache.lucene.analysis.tokenattributes.PayloadAttribute} to store
    - * a completion's payload (see {@link CompletionTokenStream#setPayload(org.apache.lucene.util.BytesRef)})
    + * Concatenates/Joins every incoming token with a separator into one output token for every path through the
    + * token stream (which is a graph).  In simple cases this yields one token, but in the presence of any tokens with
    + * a zero positionIncrmeent (e.g. synonyms) it will be more.  This filter uses the token bytes, position increment,
    + * and position length of the incoming stream.  Other attributes are not used or manipulated.
      *
      * @lucene.experimental
      */
    -public final class CompletionTokenStream extends TokenStream {
    +public final class ConcatenateGraphFilter extends TokenFilter {
    --- End diff --
    
    Perhaps it _never_ matters; yet TokenFilter exists nonetheless.  TokenFilter is more consistent with public/visible analysis components, and it makes it easier to pass-through the calls to the delegate/input.  
    
    Let me put it another way... shouldn't FingerprintFilter & ConcatenateGraphFilter do the same, whatever that is?  FingerprintFilter is already public/known so I'm trying to introduce this in a way that is consistent.  That CompletionTokenStream, kinda hidden, is currently not a TokenFilter isn't relevant.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[GitHub] lucene-solr pull request #384: LUCENE-8332 move CompletionTokenStream to Con...

Posted by dsmiley <gi...@git.apache.org>.

Github user dsmiley commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/384#discussion_r191569855
  
    --- Diff: lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ConcatenateGraphFilter.java ---
    @@ -119,27 +147,19 @@ public boolean incrementToken() throws IOException {
     
       @Override
       public void end() throws IOException {
    -    super.end();
    -    if (finiteStrings == null) {
    -      inputTokenStream.end();
    -    }
    -  }
    -
    -  @Override
    -  public void close() throws IOException {
    -    if (finiteStrings == null) {
    -      inputTokenStream.close();
    -    }
    +    restoreState(endState);
       }
     
    +  //nocommit move method to before incrementToken
       @Override
       public void reset() throws IOException {
    -    super.reset();
    -    if (hasAttribute(CharTermAttribute.class)) {
    -      // we only create this if we really need it to safe the UTF-8 to UTF-16 conversion
    -      charTermAttribute = getAttribute(CharTermAttribute.class);
    --- End diff --
    
    I understand what you are asking me to do (to put back) but don't understand _why_.  What comment above do you refer to?  Where your feedback is?  That comment doesn't reference that; or at least it's very non-obvious to me if there's a relationship (and we should add comments explaining).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[GitHub] lucene-solr pull request #384: LUCENE-8332 move CompletionTokenStream to Con...

Posted by dsmiley <gi...@git.apache.org>.

Github user dsmiley commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/384#discussion_r191578308
  
    --- Diff: lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ConcatenateGraphFilter.java ---
    @@ -31,80 +33,106 @@
     import org.apache.lucene.util.IOUtils;
     import org.apache.lucene.util.IntsRef;
     import org.apache.lucene.util.automaton.Automaton;
    -import org.apache.lucene.util.automaton.FiniteStringsIterator;
     import org.apache.lucene.util.automaton.LimitedFiniteStringsIterator;
     import org.apache.lucene.util.automaton.Operations;
    +import org.apache.lucene.util.automaton.TooComplexToDeterminizeException;
     import org.apache.lucene.util.automaton.Transition;
     import org.apache.lucene.util.fst.Util;
     
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.DEFAULT_MAX_GRAPH_EXPANSIONS;
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.DEFAULT_PRESERVE_POSITION_INCREMENTS;
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.DEFAULT_PRESERVE_SEP;
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.SEP_LABEL;
    -
     /**
    - * Token stream which converts a provided token stream to an automaton.
    - * The accepted strings enumeration from the automaton are available through the
    - * {@link org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute} attribute
    - * The token stream uses a {@link org.apache.lucene.analysis.tokenattributes.PayloadAttribute} to store
    - * a completion's payload (see {@link CompletionTokenStream#setPayload(org.apache.lucene.util.BytesRef)})
    + * Concatenates/Joins every incoming token with a separator into one output token for every path through the
    + * token stream (which is a graph).  In simple cases this yields one token, but in the presence of any tokens with
    + * a zero positionIncrmeent (e.g. synonyms) it will be more.  This filter uses the token bytes, position increment,
    + * and position length of the incoming stream.  Other attributes are not used or manipulated.
      *
      * @lucene.experimental
      */
    -public final class CompletionTokenStream extends TokenStream {
    +public final class ConcatenateGraphFilter extends TokenFilter {
    +
    +  /*
    +   * Token stream which converts a provided token stream to an automaton.
    +   * The accepted strings enumeration from the automaton are available through the
    +   * {@link org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute} attribute
    +   * The token stream uses a {@link org.apache.lucene.analysis.tokenattributes.PayloadAttribute} to store
    +   * a completion's payload (see {@link ConcatenateGraphFilter#setPayload(org.apache.lucene.util.BytesRef)})
    +   */
    +
    +  /**
    +   * Represents the separation between tokens, if
    +   * <code>preserveSep</code> is <code>true</code>.
    +   */
    +  public final static char SEP_CHAR = '\u001F';
    --- End diff --
    
    One user is the SolrTextTagger which needs to know the character separator between concatenated terms.  https://github.com/OpenSextant/SolrTextTagger/blob/master/src/main/java/org/opensextant/solrtexttagger/TermPrefixCursor.java#L45   (being added to Solr in SOLR-12376).  Hmmm, though it is a byte there, which is to your point.  Maybe I should change to a byte and add a little comment about this?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[GitHub] lucene-solr pull request #384: LUCENE-8332 move CompletionTokenStream to Con...

Posted by jimczi <gi...@git.apache.org>.

Github user jimczi commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/384#discussion_r191548433
  
    --- Diff: lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ConcatenateGraphFilter.java ---
    @@ -31,80 +33,106 @@
     import org.apache.lucene.util.IOUtils;
     import org.apache.lucene.util.IntsRef;
     import org.apache.lucene.util.automaton.Automaton;
    -import org.apache.lucene.util.automaton.FiniteStringsIterator;
     import org.apache.lucene.util.automaton.LimitedFiniteStringsIterator;
     import org.apache.lucene.util.automaton.Operations;
    +import org.apache.lucene.util.automaton.TooComplexToDeterminizeException;
     import org.apache.lucene.util.automaton.Transition;
     import org.apache.lucene.util.fst.Util;
     
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.DEFAULT_MAX_GRAPH_EXPANSIONS;
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.DEFAULT_PRESERVE_POSITION_INCREMENTS;
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.DEFAULT_PRESERVE_SEP;
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.SEP_LABEL;
    -
     /**
    - * Token stream which converts a provided token stream to an automaton.
    - * The accepted strings enumeration from the automaton are available through the
    - * {@link org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute} attribute
    - * The token stream uses a {@link org.apache.lucene.analysis.tokenattributes.PayloadAttribute} to store
    - * a completion's payload (see {@link CompletionTokenStream#setPayload(org.apache.lucene.util.BytesRef)})
    + * Concatenates/Joins every incoming token with a separator into one output token for every path through the
    + * token stream (which is a graph).  In simple cases this yields one token, but in the presence of any tokens with
    + * a zero positionIncrmeent (e.g. synonyms) it will be more.  This filter uses the token bytes, position increment,
    + * and position length of the incoming stream.  Other attributes are not used or manipulated.
      *
      * @lucene.experimental
      */
    -public final class CompletionTokenStream extends TokenStream {
    +public final class ConcatenateGraphFilter extends TokenFilter {
    +
    +  /*
    +   * Token stream which converts a provided token stream to an automaton.
    +   * The accepted strings enumeration from the automaton are available through the
    +   * {@link org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute} attribute
    +   * The token stream uses a {@link org.apache.lucene.analysis.tokenattributes.PayloadAttribute} to store
    +   * a completion's payload (see {@link ConcatenateGraphFilter#setPayload(org.apache.lucene.util.BytesRef)})
    +   */
    +
    +  /**
    +   * Represents the separation between tokens, if
    +   * <code>preserveSep</code> is <code>true</code>.
    +   */
    +  public final static char SEP_CHAR = '\u001F';
    --- End diff --
    
    I think it's better to keep the SEP_LABEL as an int. It needs to be equal to TokenStreamToAutomaton.POS_SEP anyway so you can maybe used this value directly ? PAYLOAD_SEP has the same requirement so we should also set it to POS_SEP.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[GitHub] lucene-solr pull request #384: LUCENE-8332 move CompletionTokenStream to Con...

Posted by jimczi <gi...@git.apache.org>.

Github user jimczi commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/384#discussion_r191572617
  
    --- Diff: lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ConcatenateGraphFilter.java ---
    @@ -119,27 +147,19 @@ public boolean incrementToken() throws IOException {
     
       @Override
       public void end() throws IOException {
    -    super.end();
    -    if (finiteStrings == null) {
    -      inputTokenStream.end();
    -    }
    -  }
    -
    -  @Override
    -  public void close() throws IOException {
    -    if (finiteStrings == null) {
    -      inputTokenStream.close();
    -    }
    +    restoreState(endState);
       }
     
    +  //nocommit move method to before incrementToken
       @Override
       public void reset() throws IOException {
    -    super.reset();
    -    if (hasAttribute(CharTermAttribute.class)) {
    -      // we only create this if we really need it to safe the UTF-8 to UTF-16 conversion
    -      charTermAttribute = getAttribute(CharTermAttribute.class);
    --- End diff --
    
    I am referring to this comment: 
    ````
    //nocommit move method to before incrementToken
    ````
    but maybe I misunderstood what you meant by move. Let me ask you this differently, why do you need to build the automaton on reset ? Unless there is a good reason to do it we shouldn't change the behavior here which is to build the automaton on the first call to incrementToken.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[GitHub] lucene-solr pull request #384: LUCENE-8332 move CompletionTokenStream to Con...

Posted by dsmiley <gi...@git.apache.org>.

Github user dsmiley commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/384#discussion_r191574085
  
    --- Diff: lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ConcatenateGraphFilter.java ---
    @@ -119,27 +147,19 @@ public boolean incrementToken() throws IOException {
     
       @Override
       public void end() throws IOException {
    -    super.end();
    -    if (finiteStrings == null) {
    -      inputTokenStream.end();
    -    }
    -  }
    -
    -  @Override
    -  public void close() throws IOException {
    -    if (finiteStrings == null) {
    -      inputTokenStream.close();
    -    }
    +    restoreState(endState);
       }
     
    +  //nocommit move method to before incrementToken
       @Override
       public void reset() throws IOException {
    -    super.reset();
    -    if (hasAttribute(CharTermAttribute.class)) {
    -      // we only create this if we really need it to safe the UTF-8 to UTF-16 conversion
    -      charTermAttribute = getAttribute(CharTermAttribute.class);
    --- End diff --
    
    Oh *that* comment.  By that I mean move the _method_ to follow a natural sequence.  I like classes to have methods that are organized sensibly.  When one uses a TokenStream, reset() should come before incrementToken() !  -- then end(), then close(), and any utility methods should come after callers.
    
    I moved the initialization to reset simply because it felt natural this way rather than incrementToken where it's lazy.  That's all.  That said... I hinted at a TestRandomChains failure about lifecycle needing to throw an IllegalStateException or something like that (pertaining to missing close()), and I think making this back to lazy will "fix" that.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[GitHub] lucene-solr pull request #384: LUCENE-8332 move CompletionTokenStream to Con...

Posted by dsmiley <gi...@git.apache.org>.

Github user dsmiley closed the pull request at:

    https://github.com/apache/lucene-solr/pull/384


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[GitHub] lucene-solr pull request #384: LUCENE-8332 move CompletionTokenStream to Con...

Posted by jimczi <gi...@git.apache.org>.

Github user jimczi commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/384#discussion_r191594957
  
    --- Diff: lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ConcatenateGraphFilter.java ---
    @@ -31,80 +33,106 @@
     import org.apache.lucene.util.IOUtils;
     import org.apache.lucene.util.IntsRef;
     import org.apache.lucene.util.automaton.Automaton;
    -import org.apache.lucene.util.automaton.FiniteStringsIterator;
     import org.apache.lucene.util.automaton.LimitedFiniteStringsIterator;
     import org.apache.lucene.util.automaton.Operations;
    +import org.apache.lucene.util.automaton.TooComplexToDeterminizeException;
     import org.apache.lucene.util.automaton.Transition;
     import org.apache.lucene.util.fst.Util;
     
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.DEFAULT_MAX_GRAPH_EXPANSIONS;
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.DEFAULT_PRESERVE_POSITION_INCREMENTS;
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.DEFAULT_PRESERVE_SEP;
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.SEP_LABEL;
    -
     /**
    - * Token stream which converts a provided token stream to an automaton.
    - * The accepted strings enumeration from the automaton are available through the
    - * {@link org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute} attribute
    - * The token stream uses a {@link org.apache.lucene.analysis.tokenattributes.PayloadAttribute} to store
    - * a completion's payload (see {@link CompletionTokenStream#setPayload(org.apache.lucene.util.BytesRef)})
    + * Concatenates/Joins every incoming token with a separator into one output token for every path through the
    + * token stream (which is a graph).  In simple cases this yields one token, but in the presence of any tokens with
    + * a zero positionIncrmeent (e.g. synonyms) it will be more.  This filter uses the token bytes, position increment,
    + * and position length of the incoming stream.  Other attributes are not used or manipulated.
      *
      * @lucene.experimental
      */
    -public final class CompletionTokenStream extends TokenStream {
    +public final class ConcatenateGraphFilter extends TokenFilter {
    --- End diff --
    
    Converting this token stream into a token filter is trappy. It requires to rewrite the logic completely. For instance you removed the `close` implementation, this means that it is called twice on the input stream since it is already closed here https://github.com/apache/lucene-solr/pull/384/files#diff-04d4c6352889dbffd0eb4d1e6ecd6097R197. You also added a call to super() which was avoided clearly in the original impl:
    ````
    // Don't call the super(input) ctor - this is a true delegate and has a new attribute source since we consume
        // the input stream entirely in the first call to incrementToken
    ````
    My idea was to move this token stream if only minor changes are required. If it needs a complete rewrite then we should probably reconsider. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[GitHub] lucene-solr pull request #384: LUCENE-8332 move CompletionTokenStream to Con...

Posted by jimczi <gi...@git.apache.org>.

Github user jimczi commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/384#discussion_r191546661
  
    --- Diff: lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ConcatenateGraphFilter.java ---
    @@ -31,80 +33,106 @@
     import org.apache.lucene.util.IOUtils;
     import org.apache.lucene.util.IntsRef;
     import org.apache.lucene.util.automaton.Automaton;
    -import org.apache.lucene.util.automaton.FiniteStringsIterator;
     import org.apache.lucene.util.automaton.LimitedFiniteStringsIterator;
     import org.apache.lucene.util.automaton.Operations;
    +import org.apache.lucene.util.automaton.TooComplexToDeterminizeException;
     import org.apache.lucene.util.automaton.Transition;
     import org.apache.lucene.util.fst.Util;
     
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.DEFAULT_MAX_GRAPH_EXPANSIONS;
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.DEFAULT_PRESERVE_POSITION_INCREMENTS;
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.DEFAULT_PRESERVE_SEP;
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.SEP_LABEL;
    -
     /**
    - * Token stream which converts a provided token stream to an automaton.
    - * The accepted strings enumeration from the automaton are available through the
    - * {@link org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute} attribute
    - * The token stream uses a {@link org.apache.lucene.analysis.tokenattributes.PayloadAttribute} to store
    - * a completion's payload (see {@link CompletionTokenStream#setPayload(org.apache.lucene.util.BytesRef)})
    + * Concatenates/Joins every incoming token with a separator into one output token for every path through the
    + * token stream (which is a graph).  In simple cases this yields one token, but in the presence of any tokens with
    + * a zero positionIncrmeent (e.g. synonyms) it will be more.  This filter uses the token bytes, position increment,
    + * and position length of the incoming stream.  Other attributes are not used or manipulated.
      *
      * @lucene.experimental
      */
    -public final class CompletionTokenStream extends TokenStream {
    +public final class ConcatenateGraphFilter extends TokenFilter {
    --- End diff --
    
    I still don't understand why you need to use a TokenFilter instead of a TokenStream. You need to override all the functions anyway so I don't see the benefit.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[GitHub] lucene-solr pull request #384: LUCENE-8332 move CompletionTokenStream to Con...

Posted by jimczi <gi...@git.apache.org>.

Github user jimczi commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/384#discussion_r191549169
  
    --- Diff: lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ConcatenateGraphFilter.java ---
    @@ -119,27 +147,19 @@ public boolean incrementToken() throws IOException {
     
       @Override
       public void end() throws IOException {
    -    super.end();
    -    if (finiteStrings == null) {
    -      inputTokenStream.end();
    -    }
    -  }
    -
    -  @Override
    -  public void close() throws IOException {
    -    if (finiteStrings == null) {
    -      inputTokenStream.close();
    -    }
    +    restoreState(endState);
       }
     
    +  //nocommit move method to before incrementToken
       @Override
       public void reset() throws IOException {
    -    super.reset();
    -    if (hasAttribute(CharTermAttribute.class)) {
    -      // we only create this if we really need it to safe the UTF-8 to UTF-16 conversion
    -      charTermAttribute = getAttribute(CharTermAttribute.class);
    --- End diff --
    
    The automaton should be created lazily on the first call to incrementToken like described in the comment above ;).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[GitHub] lucene-solr pull request #384: LUCENE-8332 move CompletionTokenStream to Con...

Posted by dsmiley <gi...@git.apache.org>.

Github user dsmiley commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/384#discussion_r191569307
  
    --- Diff: lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ConcatenateGraphFilter.java ---
    @@ -31,80 +33,106 @@
     import org.apache.lucene.util.IOUtils;
     import org.apache.lucene.util.IntsRef;
     import org.apache.lucene.util.automaton.Automaton;
    -import org.apache.lucene.util.automaton.FiniteStringsIterator;
     import org.apache.lucene.util.automaton.LimitedFiniteStringsIterator;
     import org.apache.lucene.util.automaton.Operations;
    +import org.apache.lucene.util.automaton.TooComplexToDeterminizeException;
     import org.apache.lucene.util.automaton.Transition;
     import org.apache.lucene.util.fst.Util;
     
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.DEFAULT_MAX_GRAPH_EXPANSIONS;
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.DEFAULT_PRESERVE_POSITION_INCREMENTS;
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.DEFAULT_PRESERVE_SEP;
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.SEP_LABEL;
    -
     /**
    - * Token stream which converts a provided token stream to an automaton.
    - * The accepted strings enumeration from the automaton are available through the
    - * {@link org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute} attribute
    - * The token stream uses a {@link org.apache.lucene.analysis.tokenattributes.PayloadAttribute} to store
    - * a completion's payload (see {@link CompletionTokenStream#setPayload(org.apache.lucene.util.BytesRef)})
    + * Concatenates/Joins every incoming token with a separator into one output token for every path through the
    + * token stream (which is a graph).  In simple cases this yields one token, but in the presence of any tokens with
    + * a zero positionIncrmeent (e.g. synonyms) it will be more.  This filter uses the token bytes, position increment,
    + * and position length of the incoming stream.  Other attributes are not used or manipulated.
      *
      * @lucene.experimental
      */
    -public final class CompletionTokenStream extends TokenStream {
    +public final class ConcatenateGraphFilter extends TokenFilter {
    +
    +  /*
    +   * Token stream which converts a provided token stream to an automaton.
    +   * The accepted strings enumeration from the automaton are available through the
    +   * {@link org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute} attribute
    +   * The token stream uses a {@link org.apache.lucene.analysis.tokenattributes.PayloadAttribute} to store
    +   * a completion's payload (see {@link ConcatenateGraphFilter#setPayload(org.apache.lucene.util.BytesRef)})
    +   */
    +
    +  /**
    +   * Represents the separation between tokens, if
    +   * <code>preserveSep</code> is <code>true</code>.
    +   */
    +  public final static char SEP_CHAR = '\u001F';
    --- End diff --
    
    I think this is the difference between an implementation detail and something more meaningful/intuitive to the user.  I'm trying to cater to the user here.  Maybe we could have both?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[GitHub] lucene-solr pull request #384: LUCENE-8332 move CompletionTokenStream to Con...

Posted by dsmiley <gi...@git.apache.org>.

Github user dsmiley commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/384#discussion_r191568572
  
    --- Diff: lucene/suggest/src/java/org/apache/lucene/search/suggest/document/NRTSuggesterBuilder.java ---
    @@ -43,6 +44,9 @@
        * in the output
        */
       public static final int PAYLOAD_SEP = '\u001F';
    +  static {
    +    assert PAYLOAD_SEP == ConcatenateGraphFilter.SEP_CHAR;
    --- End diff --
    
    I think it's useful to read the code and both see what's it's value is and that it's equal to something else without having to navigate or use IDE functions.  But this is no big deal to me; I can put back.  Presumably if these values were not the same then some a test would fail but I'm not sure.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[GitHub] lucene-solr pull request #384: LUCENE-8332 move CompletionTokenStream to Con...

Posted by jimczi <gi...@git.apache.org>.

Github user jimczi commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/384#discussion_r191549617
  
    --- Diff: lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ConcatenateGraphFilterFactory.java ---
    @@ -0,0 +1,61 @@
    +/*
    + This software was produced for the U. S. Government
    + under Contract No. W15P7T-11-C-F600, and is
    + subject to the Rights in Noncommercial Computer Software
    + and Noncommercial Computer Software Documentation
    + Clause 252.227-7014 (JUN 1995)
    +
    --- End diff --
    
    Is this really needed ? I understand that you have some constraint for the development of the main feature but this doesn't seem related so maybe use the usual license ? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[GitHub] lucene-solr pull request #384: LUCENE-8332 move CompletionTokenStream to Con...

Posted by jimczi <gi...@git.apache.org>.

Github user jimczi commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/384#discussion_r191573460
  
    --- Diff: lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ConcatenateGraphFilter.java ---
    @@ -31,80 +33,106 @@
     import org.apache.lucene.util.IOUtils;
     import org.apache.lucene.util.IntsRef;
     import org.apache.lucene.util.automaton.Automaton;
    -import org.apache.lucene.util.automaton.FiniteStringsIterator;
     import org.apache.lucene.util.automaton.LimitedFiniteStringsIterator;
     import org.apache.lucene.util.automaton.Operations;
    +import org.apache.lucene.util.automaton.TooComplexToDeterminizeException;
     import org.apache.lucene.util.automaton.Transition;
     import org.apache.lucene.util.fst.Util;
     
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.DEFAULT_MAX_GRAPH_EXPANSIONS;
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.DEFAULT_PRESERVE_POSITION_INCREMENTS;
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.DEFAULT_PRESERVE_SEP;
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.SEP_LABEL;
    -
     /**
    - * Token stream which converts a provided token stream to an automaton.
    - * The accepted strings enumeration from the automaton are available through the
    - * {@link org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute} attribute
    - * The token stream uses a {@link org.apache.lucene.analysis.tokenattributes.PayloadAttribute} to store
    - * a completion's payload (see {@link CompletionTokenStream#setPayload(org.apache.lucene.util.BytesRef)})
    + * Concatenates/Joins every incoming token with a separator into one output token for every path through the
    + * token stream (which is a graph).  In simple cases this yields one token, but in the presence of any tokens with
    + * a zero positionIncrmeent (e.g. synonyms) it will be more.  This filter uses the token bytes, position increment,
    + * and position length of the incoming stream.  Other attributes are not used or manipulated.
      *
      * @lucene.experimental
      */
    -public final class CompletionTokenStream extends TokenStream {
    +public final class ConcatenateGraphFilter extends TokenFilter {
    +
    +  /*
    +   * Token stream which converts a provided token stream to an automaton.
    +   * The accepted strings enumeration from the automaton are available through the
    +   * {@link org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute} attribute
    +   * The token stream uses a {@link org.apache.lucene.analysis.tokenattributes.PayloadAttribute} to store
    +   * a completion's payload (see {@link ConcatenateGraphFilter#setPayload(org.apache.lucene.util.BytesRef)})
    +   */
    +
    +  /**
    +   * Represents the separation between tokens, if
    +   * <code>preserveSep</code> is <code>true</code>.
    +   */
    +  public final static char SEP_CHAR = '\u001F';
    --- End diff --
    
    This value should only be used internally to encode a separator in the automaton or in the binary terms that we index. Which user are you referring to ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[GitHub] lucene-solr pull request #384: LUCENE-8332 move CompletionTokenStream to Con...

Posted by dsmiley <gi...@git.apache.org>.

Github user dsmiley commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/384#discussion_r191566890
  
    --- Diff: lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ConcatenateGraphFilterFactory.java ---
    @@ -0,0 +1,61 @@
    +/*
    + This software was produced for the U. S. Government
    + under Contract No. W15P7T-11-C-F600, and is
    + subject to the Rights in Noncommercial Computer Software
    + and Noncommercial Computer Software Documentation
    + Clause 252.227-7014 (JUN 1995)
    +
    --- End diff --
    
    Good catch; no it's not needed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[GitHub] lucene-solr pull request #384: LUCENE-8332 move CompletionTokenStream to Con...

Posted by jimczi <gi...@git.apache.org>.

Github user jimczi commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/384#discussion_r191551707
  
    --- Diff: lucene/suggest/src/java/org/apache/lucene/search/suggest/document/NRTSuggesterBuilder.java ---
    @@ -43,6 +44,9 @@
        * in the output
        */
       public static final int PAYLOAD_SEP = '\u001F';
    +  static {
    +    assert PAYLOAD_SEP == ConcatenateGraphFilter.SEP_CHAR;
    --- End diff --
    
    This can ensured in a dedicated test ?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[GitHub] lucene-solr pull request #384: LUCENE-8332 move CompletionTokenStream to Con...

Posted by dsmiley <gi...@git.apache.org>.

Github user dsmiley commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/384#discussion_r191633254
  
    --- Diff: lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ConcatenateGraphFilter.java ---
    @@ -31,80 +33,106 @@
     import org.apache.lucene.util.IOUtils;
     import org.apache.lucene.util.IntsRef;
     import org.apache.lucene.util.automaton.Automaton;
    -import org.apache.lucene.util.automaton.FiniteStringsIterator;
     import org.apache.lucene.util.automaton.LimitedFiniteStringsIterator;
     import org.apache.lucene.util.automaton.Operations;
    +import org.apache.lucene.util.automaton.TooComplexToDeterminizeException;
     import org.apache.lucene.util.automaton.Transition;
     import org.apache.lucene.util.fst.Util;
     
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.DEFAULT_MAX_GRAPH_EXPANSIONS;
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.DEFAULT_PRESERVE_POSITION_INCREMENTS;
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.DEFAULT_PRESERVE_SEP;
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.SEP_LABEL;
    -
     /**
    - * Token stream which converts a provided token stream to an automaton.
    - * The accepted strings enumeration from the automaton are available through the
    - * {@link org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute} attribute
    - * The token stream uses a {@link org.apache.lucene.analysis.tokenattributes.PayloadAttribute} to store
    - * a completion's payload (see {@link CompletionTokenStream#setPayload(org.apache.lucene.util.BytesRef)})
    + * Concatenates/Joins every incoming token with a separator into one output token for every path through the
    + * token stream (which is a graph).  In simple cases this yields one token, but in the presence of any tokens with
    + * a zero positionIncrmeent (e.g. synonyms) it will be more.  This filter uses the token bytes, position increment,
    + * and position length of the incoming stream.  Other attributes are not used or manipulated.
      *
      * @lucene.experimental
      */
    -public final class CompletionTokenStream extends TokenStream {
    +public final class ConcatenateGraphFilter extends TokenFilter {
    --- End diff --
    
    I do tend to increase the scope a bit once I get my hands on things but my intention is a better result ("better" being in the eye of the beholder, of course).  I locally caught that close() issue as well but forgot to mention it.  How about this, unless anyone says to the contrary, I'll change it to a TokenStream... and change consumption to occur in incrementToken.  But keep the name; the parent class being an implementation detail.  ConcatenateGraphFilterFactory producing a ConcatenateGraphFilter that happens to subclass TokenStream directly.  WDYT?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[GitHub] lucene-solr pull request #384: LUCENE-8332 move CompletionTokenStream to Con...

Posted by jimczi <gi...@git.apache.org>.

Github user jimczi commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/384#discussion_r191551615
  
    --- Diff: lucene/suggest/src/java/org/apache/lucene/search/suggest/document/NRTSuggesterBuilder.java ---
    @@ -43,6 +44,9 @@
        * in the output
        */
       public static final int PAYLOAD_SEP = '\u001F';
    --- End diff --
    
    You can set it to TokenStreamToAutomaton.POS_SEP directly and add a test to ensure that PAYLOAD_SEP and SEP_LABEL are equals to `0x001F` for BWC.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[GitHub] lucene-solr pull request #384: LUCENE-8332 move CompletionTokenStream to Con...

Posted by jimczi <gi...@git.apache.org>.

Github user jimczi commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/384#discussion_r191673559
  
    --- Diff: lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ConcatenateGraphFilter.java ---
    @@ -31,80 +33,106 @@
     import org.apache.lucene.util.IOUtils;
     import org.apache.lucene.util.IntsRef;
     import org.apache.lucene.util.automaton.Automaton;
    -import org.apache.lucene.util.automaton.FiniteStringsIterator;
     import org.apache.lucene.util.automaton.LimitedFiniteStringsIterator;
     import org.apache.lucene.util.automaton.Operations;
    +import org.apache.lucene.util.automaton.TooComplexToDeterminizeException;
     import org.apache.lucene.util.automaton.Transition;
     import org.apache.lucene.util.fst.Util;
     
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.DEFAULT_MAX_GRAPH_EXPANSIONS;
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.DEFAULT_PRESERVE_POSITION_INCREMENTS;
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.DEFAULT_PRESERVE_SEP;
    -import static org.apache.lucene.search.suggest.document.CompletionAnalyzer.SEP_LABEL;
    -
     /**
    - * Token stream which converts a provided token stream to an automaton.
    - * The accepted strings enumeration from the automaton are available through the
    - * {@link org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute} attribute
    - * The token stream uses a {@link org.apache.lucene.analysis.tokenattributes.PayloadAttribute} to store
    - * a completion's payload (see {@link CompletionTokenStream#setPayload(org.apache.lucene.util.BytesRef)})
    + * Concatenates/Joins every incoming token with a separator into one output token for every path through the
    + * token stream (which is a graph).  In simple cases this yields one token, but in the presence of any tokens with
    + * a zero positionIncrmeent (e.g. synonyms) it will be more.  This filter uses the token bytes, position increment,
    + * and position length of the incoming stream.  Other attributes are not used or manipulated.
      *
      * @lucene.experimental
      */
    -public final class CompletionTokenStream extends TokenStream {
    +public final class ConcatenateGraphFilter extends TokenFilter {
    --- End diff --
    
    +1


---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[GitHub] lucene-solr issue #384: LUCENE-8332 move CompletionTokenStream to Concatenat...

Posted by dsmiley <gi...@git.apache.org>.

Github user dsmiley commented on the issue:

    https://github.com/apache/lucene-solr/pull/384
  
    There's a test failure in TestFactories.test which calls BaseTokenStreamTestCase.checkResetException.
    
    `ant test -Dtestcase=TestFactories -Dtests.seed=543C55AE2375085C`
    ```
       [junit4]    > Throwable #1: java.lang.AssertionError: didn't get expected exception when close() not called
       [junit4]    >        at __randomizedtesting.SeedInfo.seed([543C55AE2375085C:DC686A748D8965A4]:0)
       [junit4]    >        at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkResetException(BaseTokenStreamTestCase.java:453)
       [junit4]    >        at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:546)
       [junit4]    >        at org.apache.lucene.analysis.core.TestFactories.doTestTokenFilter(TestFactories.java:118)
       [junit4]    >        at org.apache.lucene.analysis.core.TestFactories.test(TestFactories.java:68)
       [junit4]    >        at java.lang.Thread.run(Thread.java:748)
    ```
    The issue is that if some faulty code forgets to call close(), an exception would be ideal to be thrown but for this filter it won't be because the inputTokenStream is closed properly internally correctly by toAutomaton.  Checks like this are kinda... I dunno, I don't like it.  IMO it's asking too much of a TokenFilter to insist that it's cranky when it's mistreated; maybe some components (like this one) are more resilient?  Maybe BaseTokenStreamTestCase (or somewhere else) should keep track of components that are resilient in this way so that it doesn't exercise these sort of tests?  If we really want to insist that every component be cranky when mistreated, then perhaps we could have toAutomaton _not_ close the inputTokenStream so that we can delegate that from CGF's close?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org