You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Joe Wong <jw...@adacado.com> on 2014/03/20 20:57:46 UTC

Possible issue with Tokenizer in lucene-analyzers-common-4.6.1

Hi

We're planning to upgrade lucene-analyzers-commons 4.3.0 to  4.6.1 . While
running our unit test with 4.6.1 it fails at
org.apache.lucene.analysis.Tokenizer on line 88 (setReader method). There
it checks if input != ILLEGAL_STATE_READER then throws
IllegalStateException. Should it not be if input == ILLEGAL_STATE_READER?

Regards,
Joe

RE: Possible issue with Tokenizer in lucene-analyzers-common-4.6.1

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

I am glad that I was able to help you!

One more optimization in your consumer: CharTermAttribute implements CharSequence, so you can directly append it to StringBuilder, no need to call toString(), see http://goo.gl/Ffg9tW:
	builder.append(termAttribute);
This will save additional useless String object instantiation.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Joe Wong [mailto:jwong@adacado.com]
> Sent: Thursday, March 20, 2014 11:50 PM
> To: java-user@lucene.apache.org
> Subject: Re: Possible issue with Tokenizer in lucene-analyzers-common-4.6.1
> 
> Thanks Uwe. It worked.
> 
> 
> 
> 
> On Thu, Mar 20, 2014 at 3:28 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
> 
> > Hi,
> >
> > the IllegalStateException tells you what's wrong: "TokenStream
> > contract
> > violation: close() call missing"
> >
> > Analyzer internally reuses TokenStreams, so if you call
> > Analyzer.tokenStream() a second time it will return the same instance
> > of your TokenStream. On that second call the state machine detects
> > that it was not closed before, which is easy to see:
> > Your consumer never closes the tokenstream returned by the analyzer
> > after it finishes the incrementToken() loop. This is why I said that
> > you have to follow the official consuming workflow as described on the
> > TokenStream API page. Be sure to use try...finally or the Java 7
> > try-with resources to be sure the TokenStream is closed after using.
> > This will also close the reader (which is not needed for
> > StringReaders, but TokenStream needs the additional cleanup of other
> > internal resources when calling close - the state machine ensures this is
> done).
> >
> > One additional tip: In Lucene 4.6+ it is no longer needed to pass a
> > StringReader to analyze Strings. There is a second method in Analyzer
> > that takes a String to analyze (instead of Reader). This one uses an
> > optimized workflow internally.
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> >
> > > -----Original Message-----
> > > From: Joe Wong [mailto:jwong@adacado.com]
> > > Sent: Thursday, March 20, 2014 11:13 PM
> > > To: java-user@lucene.apache.org
> > > Subject: Re: Possible issue with Tokenizer in
> > lucene-analyzers-common-4.6.1
> > >
> > > Hi Uwe,
> > >
> > > Thanks for the reply. I'm not familiar with the usage of Lucene so
> > > any
> > help
> > > would be appreciated.
> > >
> > > In our test we are executing several consecutive stemming operations
> > > (exception is thrown when the second stemmer.stem() method is
> > > called). In the code, see below, it does call the reset() method but
> > > like you say it
> > could
> > > be called at the wrong place.
> > >
> > > @Test
> > >     public void stem() {
> > >         LuceneStemmer stemmer = new LuceneStemmer();
> > >         assertEquals("thing", stemmer.stem("thing"));
> > >         assertEquals("thing", stemmer.stem("things"));
> > >         assertEquals("genius", stemmer.stem("geniuses"));
> > >         assertEquals("fri", stemmer.stem("fries"));
> > >         assertEquals("gentli", stemmer.stem("gently"));
> > >     }
> > >
> > > --- LuceneStemmer class ---
> > > import java.io.IOException;
> > > import java.io.StringReader;
> > > import org.apache.lucene.analysis.Analyzer;
> > > import org.apache.lucene.analysis.TokenStream;
> > > import org.apache.lucene.analysis.en.PorterStemFilter;
> > > import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
> > >
> > >
> > > public class LuceneStemmer implements Stemmer {
> > >
> > >     /** Analyzer that tokenizes on whitespace, and lower-cases and
> > > stems words */
> > >     private Analyzer analyzer = new StemmingAnalyzer();
> > >
> > >     /**
> > >      * Returns version of text with all words lower-cased and stemmed
> > >      * @param text String to stem
> > >      * @return stemmed text
> > >      */
> > >     @Override
> > >     public String stem(String text) {
> > >         StringBuilder builder = new StringBuilder();
> > >         try {
> > >             TokenStream tokenStream = analyzer.tokenStream(null, new
> > > StringReader(text));
> > >             tokenStream.reset();
> > >
> > >             CharTermAttribute termAttribute =
> > > tokenStream.getAttribute(CharTermAttribute.class);
> > >             while (tokenStream.incrementToken()) {
> > >                 if (builder.length() > 0) {
> > >                     builder.append(' ');
> > >                 }
> > >                 builder.append(termAttribute.toString());
> > >             }
> > >         } catch (IOException e) {
> > >             // shouldn't happen reading from a StringReader, but you
> > never know
> > >             throw new RuntimeException(e.getMessage(), e);
> > >         }
> > >         return builder.toString();
> > >     }
> > >
> > > }
> > >
> > > --- StemmingAnalyzer class ----
> > >
> > > import com.google.common.collect.Sets; import java.io.Reader; import
> > > java.util.Collections; import java.util.Set; import
> > > org.apache.lucene.analysis.Analyzer;
> > > import org.apache.lucene.analysis.TokenStream;
> > > import org.apache.lucene.analysis.Tokenizer;
> > > import org.apache.lucene.analysis.core.LowerCaseFilter;
> > > import org.apache.lucene.analysis.core.StopFilter;
> > > import org.apache.lucene.analysis.core.WhitespaceTokenizer;
> > > import org.apache.lucene.analysis.en.PorterStemFilter;
> > > import org.apache.lucene.analysis.util.CharArraySet;
> > >
> > > import static com.adacado.Constants.*;
> > >
> > > public final class StemmingAnalyzer extends Analyzer {
> > >
> > >     private Set<String> stopWords;
> > >
> > >     public StemmingAnalyzer() {
> > >         this.stopWords = Collections.EMPTY_SET;
> > >     }
> > >
> > >     public StemmingAnalyzer(Set<String> stopWords) {
> > >         this.stopWords = stopWords;
> > >     }
> > >
> > >     public StemmingAnalyzer(String... stopWords) {
> > >         this.stopWords = Sets.newHashSet(stopWords);
> > >     }
> > >
> > >     @Override
> > >     protected TokenStreamComponents createComponents(String
> > > fieldName, Reader reader) {
> > >         Tokenizer source = new WhitespaceTokenizer(LUCENE_VERSION,
> > > reader);
> > >         TokenStream filter = new StopFilter(LUCENE_VERSION,
> > >                                             new PorterStemFilter(new
> > > LowerCaseFilter(LUCENE_VERSION, source)),
> > >
> > > CharArraySet.copy(LUCENE_VERSION, stopWords));
> > >         return new TokenStreamComponents(source, filter);
> > >     }
> > >
> > > }
> > >
> > > Stack trace
> > > java.lang.IllegalStateException: TokenStream contract violation:
> > > close()
> > call
> > > missing at
> > > org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:89)
> > > at
> > >
> org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader(
> > > A
> > > nalyzer.java:307)
> > > at
> > > org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:145)
> > > at LuceneStemmer.stem(LuceneStemmer.java:28)
> > > at LuceneStemmerTest.stem(LuceneStemmerTest.java:16)
> > >
> > > Thanks.
> > >
> > > Regards,
> > > Joe
> > >
> > >
> > > On Thu, Mar 20, 2014 at 1:40 PM, Uwe Schindler <uw...@thetaphi.de>
> wrote:
> > >
> > > > Hi Joe,
> > > >
> > > > in Lucene 4.6, the TokenStream/Tokenizer APIs got some additional
> > > > state machine checks to ensure that consumers and subclasses of
> > > > those abstract interfaces are implemented in a correct way - they
> > > > are not easy to understand, because they are implemented in that
> > > > way to ensure they don't affect performance. If your test case
> > > > consumes the Tokenizer/TokenStream in a wrong way (e.g. missing to
> > > > call reset() or
> > > > setReader() at correct places), an IllegalStateException is thrown.
> > > > The ILLEGAL_STATE_READER is there to ensure that the consumer gets
> > > > a correct exception if it calls
> > > > setReader() or reset() in the wrong order (or multiple times).
> > > >
> > > > The checks in the base class are definitely OK, if you hit the
> > > > IllegalStateException, your have some problems in your
> > > > implementation of the Tokenizer/TokenStream interface (e.g.
> > > > missing super() calls or calling
> > > > reset() from inside setReader() or whatever). Or, the consumer
> > > > does not respect the full documented workflow:
> > > >
> > > http://lucene.apache.org/core/4_6_1/core/org/apache/lucene/analysis/
> > > To
> > > > kenStream.html
> > > >
> > > > If you have TokenFilters in your analysis chain, the source of
> > > > error may also be missing super delegations in reset(), end(),...
> > > > If you need further help, post your implementation of the consumer
> > > > in your test case or post your analysis chain and custom
> > > > Tokenizers. You may also post the stack trace in addition, because
> > > > this helps to find out what call sequence you have.
> > > >
> > > > Uwe
> > > >
> > > > -----
> > > > Uwe Schindler
> > > > H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de
> > > > eMail: uwe@thetaphi.de
> > > >
> > > > > -----Original Message-----
> > > > > From: Joe Wong [mailto:jwong@adacado.com]
> > > > > Sent: Thursday, March 20, 2014 8:58 PM
> > > > > To: java-user@lucene.apache.org
> > > > > Subject: Possible issue with Tokenizer in
> > > > > lucene-analyzers-common-4.6.1
> > > > >
> > > > > Hi
> > > > >
> > > > > We're planning to upgrade lucene-analyzers-commons 4.3.0 to  4.6.1 .
> > > > While
> > > > > running our unit test with 4.6.1 it fails at
> > > > > org.apache.lucene.analysis.Tokenizer on line 88 (setReader method).
> > > > > There it checks if input != ILLEGAL_STATE_READER then throws
> > > > > IllegalStateException. Should it not be if input ==
> > > ILLEGAL_STATE_READER?
> > > > >
> > > > > Regards,
> > > > > Joe
> > > >
> > > >
> > > > ------------------------------------------------------------------
> > > > --- To unsubscribe, e-mail:
> > > > java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Possible issue with Tokenizer in lucene-analyzers-common-4.6.1

Posted by Joe Wong <jw...@adacado.com>.
Thanks Uwe. It worked.




On Thu, Mar 20, 2014 at 3:28 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

> Hi,
>
> the IllegalStateException tells you what's wrong: "TokenStream contract
> violation: close() call missing"
>
> Analyzer internally reuses TokenStreams, so if you call
> Analyzer.tokenStream() a second time it will return the same instance of
> your TokenStream. On that second call the state machine detects that it was
> not closed before, which is easy to see:
> Your consumer never closes the tokenstream returned by the analyzer after
> it finishes the incrementToken() loop. This is why I said that you have to
> follow the official consuming workflow as described on the TokenStream API
> page. Be sure to use try...finally or the Java 7 try-with resources to be
> sure the TokenStream is closed after using. This will also close the reader
> (which is not needed for StringReaders, but TokenStream needs the
> additional cleanup of other internal resources when calling close - the
> state machine ensures this is done).
>
> One additional tip: In Lucene 4.6+ it is no longer needed to pass a
> StringReader to analyze Strings. There is a second method in Analyzer that
> takes a String to analyze (instead of Reader). This one uses an optimized
> workflow internally.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
> > -----Original Message-----
> > From: Joe Wong [mailto:jwong@adacado.com]
> > Sent: Thursday, March 20, 2014 11:13 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Possible issue with Tokenizer in
> lucene-analyzers-common-4.6.1
> >
> > Hi Uwe,
> >
> > Thanks for the reply. I'm not familiar with the usage of Lucene so any
> help
> > would be appreciated.
> >
> > In our test we are executing several consecutive stemming operations
> > (exception is thrown when the second stemmer.stem() method is called). In
> > the code, see below, it does call the reset() method but like you say it
> could
> > be called at the wrong place.
> >
> > @Test
> >     public void stem() {
> >         LuceneStemmer stemmer = new LuceneStemmer();
> >         assertEquals("thing", stemmer.stem("thing"));
> >         assertEquals("thing", stemmer.stem("things"));
> >         assertEquals("genius", stemmer.stem("geniuses"));
> >         assertEquals("fri", stemmer.stem("fries"));
> >         assertEquals("gentli", stemmer.stem("gently"));
> >     }
> >
> > --- LuceneStemmer class ---
> > import java.io.IOException;
> > import java.io.StringReader;
> > import org.apache.lucene.analysis.Analyzer;
> > import org.apache.lucene.analysis.TokenStream;
> > import org.apache.lucene.analysis.en.PorterStemFilter;
> > import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
> >
> >
> > public class LuceneStemmer implements Stemmer {
> >
> >     /** Analyzer that tokenizes on whitespace, and lower-cases and stems
> > words */
> >     private Analyzer analyzer = new StemmingAnalyzer();
> >
> >     /**
> >      * Returns version of text with all words lower-cased and stemmed
> >      * @param text String to stem
> >      * @return stemmed text
> >      */
> >     @Override
> >     public String stem(String text) {
> >         StringBuilder builder = new StringBuilder();
> >         try {
> >             TokenStream tokenStream = analyzer.tokenStream(null, new
> > StringReader(text));
> >             tokenStream.reset();
> >
> >             CharTermAttribute termAttribute =
> > tokenStream.getAttribute(CharTermAttribute.class);
> >             while (tokenStream.incrementToken()) {
> >                 if (builder.length() > 0) {
> >                     builder.append(' ');
> >                 }
> >                 builder.append(termAttribute.toString());
> >             }
> >         } catch (IOException e) {
> >             // shouldn't happen reading from a StringReader, but you
> never know
> >             throw new RuntimeException(e.getMessage(), e);
> >         }
> >         return builder.toString();
> >     }
> >
> > }
> >
> > --- StemmingAnalyzer class ----
> >
> > import com.google.common.collect.Sets;
> > import java.io.Reader;
> > import java.util.Collections;
> > import java.util.Set;
> > import org.apache.lucene.analysis.Analyzer;
> > import org.apache.lucene.analysis.TokenStream;
> > import org.apache.lucene.analysis.Tokenizer;
> > import org.apache.lucene.analysis.core.LowerCaseFilter;
> > import org.apache.lucene.analysis.core.StopFilter;
> > import org.apache.lucene.analysis.core.WhitespaceTokenizer;
> > import org.apache.lucene.analysis.en.PorterStemFilter;
> > import org.apache.lucene.analysis.util.CharArraySet;
> >
> > import static com.adacado.Constants.*;
> >
> > public final class StemmingAnalyzer extends Analyzer {
> >
> >     private Set<String> stopWords;
> >
> >     public StemmingAnalyzer() {
> >         this.stopWords = Collections.EMPTY_SET;
> >     }
> >
> >     public StemmingAnalyzer(Set<String> stopWords) {
> >         this.stopWords = stopWords;
> >     }
> >
> >     public StemmingAnalyzer(String... stopWords) {
> >         this.stopWords = Sets.newHashSet(stopWords);
> >     }
> >
> >     @Override
> >     protected TokenStreamComponents createComponents(String fieldName,
> > Reader reader) {
> >         Tokenizer source = new WhitespaceTokenizer(LUCENE_VERSION,
> > reader);
> >         TokenStream filter = new StopFilter(LUCENE_VERSION,
> >                                             new PorterStemFilter(new
> > LowerCaseFilter(LUCENE_VERSION, source)),
> >
> > CharArraySet.copy(LUCENE_VERSION, stopWords));
> >         return new TokenStreamComponents(source, filter);
> >     }
> >
> > }
> >
> > Stack trace
> > java.lang.IllegalStateException: TokenStream contract violation: close()
> call
> > missing at
> > org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:89)
> > at
> > org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader(A
> > nalyzer.java:307)
> > at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:145)
> > at LuceneStemmer.stem(LuceneStemmer.java:28)
> > at LuceneStemmerTest.stem(LuceneStemmerTest.java:16)
> >
> > Thanks.
> >
> > Regards,
> > Joe
> >
> >
> > On Thu, Mar 20, 2014 at 1:40 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
> >
> > > Hi Joe,
> > >
> > > in Lucene 4.6, the TokenStream/Tokenizer APIs got some additional
> > > state machine checks to ensure that consumers and subclasses of those
> > > abstract interfaces are implemented in a correct way - they are not
> > > easy to understand, because they are implemented in that way to ensure
> > > they don't affect performance. If your test case consumes the
> > > Tokenizer/TokenStream in a wrong way (e.g. missing to call reset() or
> > > setReader() at correct places), an IllegalStateException is thrown.
> > > The ILLEGAL_STATE_READER is there to ensure that the consumer gets a
> > > correct exception if it calls
> > > setReader() or reset() in the wrong order (or multiple times).
> > >
> > > The checks in the base class are definitely OK, if you hit the
> > > IllegalStateException, your have some problems in your implementation
> > > of the Tokenizer/TokenStream interface (e.g. missing super() calls or
> > > calling
> > > reset() from inside setReader() or whatever). Or, the consumer does
> > > not respect the full documented workflow:
> > >
> > http://lucene.apache.org/core/4_6_1/core/org/apache/lucene/analysis/To
> > > kenStream.html
> > >
> > > If you have TokenFilters in your analysis chain, the source of error
> > > may also be missing super delegations in reset(), end(),... If you
> > > need further help, post your implementation of the consumer in your
> > > test case or post your analysis chain and custom Tokenizers. You may
> > > also post the stack trace in addition, because this helps to find out
> > > what call sequence you have.
> > >
> > > Uwe
> > >
> > > -----
> > > Uwe Schindler
> > > H.-H.-Meier-Allee 63, D-28213 Bremen
> > > http://www.thetaphi.de
> > > eMail: uwe@thetaphi.de
> > >
> > > > -----Original Message-----
> > > > From: Joe Wong [mailto:jwong@adacado.com]
> > > > Sent: Thursday, March 20, 2014 8:58 PM
> > > > To: java-user@lucene.apache.org
> > > > Subject: Possible issue with Tokenizer in
> > > > lucene-analyzers-common-4.6.1
> > > >
> > > > Hi
> > > >
> > > > We're planning to upgrade lucene-analyzers-commons 4.3.0 to  4.6.1 .
> > > While
> > > > running our unit test with 4.6.1 it fails at
> > > > org.apache.lucene.analysis.Tokenizer on line 88 (setReader method).
> > > > There it checks if input != ILLEGAL_STATE_READER then throws
> > > > IllegalStateException. Should it not be if input ==
> > ILLEGAL_STATE_READER?
> > > >
> > > > Regards,
> > > > Joe
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: Possible issue with Tokenizer in lucene-analyzers-common-4.6.1

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

the IllegalStateException tells you what's wrong: "TokenStream contract violation: close() call missing"

Analyzer internally reuses TokenStreams, so if you call Analyzer.tokenStream() a second time it will return the same instance of your TokenStream. On that second call the state machine detects that it was not closed before, which is easy to see:
Your consumer never closes the tokenstream returned by the analyzer after it finishes the incrementToken() loop. This is why I said that you have to follow the official consuming workflow as described on the TokenStream API page. Be sure to use try...finally or the Java 7 try-with resources to be sure the TokenStream is closed after using. This will also close the reader (which is not needed for StringReaders, but TokenStream needs the additional cleanup of other internal resources when calling close - the state machine ensures this is done).

One additional tip: In Lucene 4.6+ it is no longer needed to pass a StringReader to analyze Strings. There is a second method in Analyzer that takes a String to analyze (instead of Reader). This one uses an optimized workflow internally.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Joe Wong [mailto:jwong@adacado.com]
> Sent: Thursday, March 20, 2014 11:13 PM
> To: java-user@lucene.apache.org
> Subject: Re: Possible issue with Tokenizer in lucene-analyzers-common-4.6.1
> 
> Hi Uwe,
> 
> Thanks for the reply. I'm not familiar with the usage of Lucene so any help
> would be appreciated.
> 
> In our test we are executing several consecutive stemming operations
> (exception is thrown when the second stemmer.stem() method is called). In
> the code, see below, it does call the reset() method but like you say it could
> be called at the wrong place.
> 
> @Test
>     public void stem() {
>         LuceneStemmer stemmer = new LuceneStemmer();
>         assertEquals("thing", stemmer.stem("thing"));
>         assertEquals("thing", stemmer.stem("things"));
>         assertEquals("genius", stemmer.stem("geniuses"));
>         assertEquals("fri", stemmer.stem("fries"));
>         assertEquals("gentli", stemmer.stem("gently"));
>     }
> 
> --- LuceneStemmer class ---
> import java.io.IOException;
> import java.io.StringReader;
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.en.PorterStemFilter;
> import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
> 
> 
> public class LuceneStemmer implements Stemmer {
> 
>     /** Analyzer that tokenizes on whitespace, and lower-cases and stems
> words */
>     private Analyzer analyzer = new StemmingAnalyzer();
> 
>     /**
>      * Returns version of text with all words lower-cased and stemmed
>      * @param text String to stem
>      * @return stemmed text
>      */
>     @Override
>     public String stem(String text) {
>         StringBuilder builder = new StringBuilder();
>         try {
>             TokenStream tokenStream = analyzer.tokenStream(null, new
> StringReader(text));
>             tokenStream.reset();
> 
>             CharTermAttribute termAttribute =
> tokenStream.getAttribute(CharTermAttribute.class);
>             while (tokenStream.incrementToken()) {
>                 if (builder.length() > 0) {
>                     builder.append(' ');
>                 }
>                 builder.append(termAttribute.toString());
>             }
>         } catch (IOException e) {
>             // shouldn't happen reading from a StringReader, but you never know
>             throw new RuntimeException(e.getMessage(), e);
>         }
>         return builder.toString();
>     }
> 
> }
> 
> --- StemmingAnalyzer class ----
> 
> import com.google.common.collect.Sets;
> import java.io.Reader;
> import java.util.Collections;
> import java.util.Set;
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.Tokenizer;
> import org.apache.lucene.analysis.core.LowerCaseFilter;
> import org.apache.lucene.analysis.core.StopFilter;
> import org.apache.lucene.analysis.core.WhitespaceTokenizer;
> import org.apache.lucene.analysis.en.PorterStemFilter;
> import org.apache.lucene.analysis.util.CharArraySet;
> 
> import static com.adacado.Constants.*;
> 
> public final class StemmingAnalyzer extends Analyzer {
> 
>     private Set<String> stopWords;
> 
>     public StemmingAnalyzer() {
>         this.stopWords = Collections.EMPTY_SET;
>     }
> 
>     public StemmingAnalyzer(Set<String> stopWords) {
>         this.stopWords = stopWords;
>     }
> 
>     public StemmingAnalyzer(String... stopWords) {
>         this.stopWords = Sets.newHashSet(stopWords);
>     }
> 
>     @Override
>     protected TokenStreamComponents createComponents(String fieldName,
> Reader reader) {
>         Tokenizer source = new WhitespaceTokenizer(LUCENE_VERSION,
> reader);
>         TokenStream filter = new StopFilter(LUCENE_VERSION,
>                                             new PorterStemFilter(new
> LowerCaseFilter(LUCENE_VERSION, source)),
> 
> CharArraySet.copy(LUCENE_VERSION, stopWords));
>         return new TokenStreamComponents(source, filter);
>     }
> 
> }
> 
> Stack trace
> java.lang.IllegalStateException: TokenStream contract violation: close() call
> missing at
> org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:89)
> at
> org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader(A
> nalyzer.java:307)
> at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:145)
> at LuceneStemmer.stem(LuceneStemmer.java:28)
> at LuceneStemmerTest.stem(LuceneStemmerTest.java:16)
> 
> Thanks.
> 
> Regards,
> Joe
> 
> 
> On Thu, Mar 20, 2014 at 1:40 PM, Uwe Schindler <uw...@thetaphi.de> wrote:
> 
> > Hi Joe,
> >
> > in Lucene 4.6, the TokenStream/Tokenizer APIs got some additional
> > state machine checks to ensure that consumers and subclasses of those
> > abstract interfaces are implemented in a correct way - they are not
> > easy to understand, because they are implemented in that way to ensure
> > they don't affect performance. If your test case consumes the
> > Tokenizer/TokenStream in a wrong way (e.g. missing to call reset() or
> > setReader() at correct places), an IllegalStateException is thrown.
> > The ILLEGAL_STATE_READER is there to ensure that the consumer gets a
> > correct exception if it calls
> > setReader() or reset() in the wrong order (or multiple times).
> >
> > The checks in the base class are definitely OK, if you hit the
> > IllegalStateException, your have some problems in your implementation
> > of the Tokenizer/TokenStream interface (e.g. missing super() calls or
> > calling
> > reset() from inside setReader() or whatever). Or, the consumer does
> > not respect the full documented workflow:
> >
> http://lucene.apache.org/core/4_6_1/core/org/apache/lucene/analysis/To
> > kenStream.html
> >
> > If you have TokenFilters in your analysis chain, the source of error
> > may also be missing super delegations in reset(), end(),... If you
> > need further help, post your implementation of the consumer in your
> > test case or post your analysis chain and custom Tokenizers. You may
> > also post the stack trace in addition, because this helps to find out
> > what call sequence you have.
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> > > -----Original Message-----
> > > From: Joe Wong [mailto:jwong@adacado.com]
> > > Sent: Thursday, March 20, 2014 8:58 PM
> > > To: java-user@lucene.apache.org
> > > Subject: Possible issue with Tokenizer in
> > > lucene-analyzers-common-4.6.1
> > >
> > > Hi
> > >
> > > We're planning to upgrade lucene-analyzers-commons 4.3.0 to  4.6.1 .
> > While
> > > running our unit test with 4.6.1 it fails at
> > > org.apache.lucene.analysis.Tokenizer on line 88 (setReader method).
> > > There it checks if input != ILLEGAL_STATE_READER then throws
> > > IllegalStateException. Should it not be if input ==
> ILLEGAL_STATE_READER?
> > >
> > > Regards,
> > > Joe
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Possible issue with Tokenizer in lucene-analyzers-common-4.6.1

Posted by Joe Wong <jw...@adacado.com>.
Hi Uwe,

Thanks for the reply. I'm not familiar with the usage of Lucene so any help
would be appreciated.

In our test we are executing several consecutive stemming operations
(exception is thrown when the second stemmer.stem() method is called). In
the code, see below, it does call the reset() method but like you say it
could be called at the wrong place.

@Test
    public void stem() {
        LuceneStemmer stemmer = new LuceneStemmer();
        assertEquals("thing", stemmer.stem("thing"));
        assertEquals("thing", stemmer.stem("things"));
        assertEquals("genius", stemmer.stem("geniuses"));
        assertEquals("fri", stemmer.stem("fries"));
        assertEquals("gentli", stemmer.stem("gently"));
    }

--- LuceneStemmer class ---
import java.io.IOException;
import java.io.StringReader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.en.PorterStemFilter;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;


public class LuceneStemmer implements Stemmer {

    /** Analyzer that tokenizes on whitespace, and lower-cases and stems
words */
    private Analyzer analyzer = new StemmingAnalyzer();

    /**
     * Returns version of text with all words lower-cased and stemmed
     * @param text String to stem
     * @return stemmed text
     */
    @Override
    public String stem(String text) {
        StringBuilder builder = new StringBuilder();
        try {
            TokenStream tokenStream = analyzer.tokenStream(null, new
StringReader(text));
            tokenStream.reset();

            CharTermAttribute termAttribute =
tokenStream.getAttribute(CharTermAttribute.class);
            while (tokenStream.incrementToken()) {
                if (builder.length() > 0) {
                    builder.append(' ');
                }
                builder.append(termAttribute.toString());
            }
        } catch (IOException e) {
            // shouldn't happen reading from a StringReader, but you never
know
            throw new RuntimeException(e.getMessage(), e);
        }
        return builder.toString();
    }

}

--- StemmingAnalyzer class ----

import com.google.common.collect.Sets;
import java.io.Reader;
import java.util.Collections;
import java.util.Set;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.StopFilter;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;
import org.apache.lucene.analysis.en.PorterStemFilter;
import org.apache.lucene.analysis.util.CharArraySet;

import static com.adacado.Constants.*;

public final class StemmingAnalyzer extends Analyzer {

    private Set<String> stopWords;

    public StemmingAnalyzer() {
        this.stopWords = Collections.EMPTY_SET;
    }

    public StemmingAnalyzer(Set<String> stopWords) {
        this.stopWords = stopWords;
    }

    public StemmingAnalyzer(String... stopWords) {
        this.stopWords = Sets.newHashSet(stopWords);
    }

    @Override
    protected TokenStreamComponents createComponents(String fieldName,
Reader reader) {
        Tokenizer source = new WhitespaceTokenizer(LUCENE_VERSION, reader);
        TokenStream filter = new StopFilter(LUCENE_VERSION,
                                            new PorterStemFilter(new
LowerCaseFilter(LUCENE_VERSION, source)),

CharArraySet.copy(LUCENE_VERSION, stopWords));
        return new TokenStreamComponents(source, filter);
    }

}

Stack trace
java.lang.IllegalStateException: TokenStream contract violation: close()
call missing
at org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:89)
at
org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader(Analyzer.java:307)
at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:145)
at LuceneStemmer.stem(LuceneStemmer.java:28)
at LuceneStemmerTest.stem(LuceneStemmerTest.java:16)

Thanks.

Regards,
Joe


On Thu, Mar 20, 2014 at 1:40 PM, Uwe Schindler <uw...@thetaphi.de> wrote:

> Hi Joe,
>
> in Lucene 4.6, the TokenStream/Tokenizer APIs got some additional state
> machine checks to ensure that consumers and subclasses of those abstract
> interfaces are implemented in a correct way - they are not easy to
> understand, because they are implemented in that way to ensure they don't
> affect performance. If your test case consumes the Tokenizer/TokenStream in
> a wrong way (e.g. missing to call reset() or setReader() at correct
> places), an IllegalStateException is thrown. The ILLEGAL_STATE_READER is
> there to ensure that the consumer gets a correct exception if it calls
> setReader() or reset() in the wrong order (or multiple times).
>
> The checks in the base class are definitely OK, if you hit the
> IllegalStateException, your have some problems in your implementation of
> the Tokenizer/TokenStream interface (e.g. missing super() calls or calling
> reset() from inside setReader() or whatever). Or, the consumer does not
> respect the full documented workflow:
> http://lucene.apache.org/core/4_6_1/core/org/apache/lucene/analysis/TokenStream.html
>
> If you have TokenFilters in your analysis chain, the source of error may
> also be missing super delegations in reset(), end(),... If you need further
> help, post your implementation of the consumer in your test case or post
> your analysis chain and custom Tokenizers. You may also post the stack
> trace in addition, because this helps to find out what call sequence you
> have.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> > -----Original Message-----
> > From: Joe Wong [mailto:jwong@adacado.com]
> > Sent: Thursday, March 20, 2014 8:58 PM
> > To: java-user@lucene.apache.org
> > Subject: Possible issue with Tokenizer in lucene-analyzers-common-4.6.1
> >
> > Hi
> >
> > We're planning to upgrade lucene-analyzers-commons 4.3.0 to  4.6.1 .
> While
> > running our unit test with 4.6.1 it fails at
> > org.apache.lucene.analysis.Tokenizer on line 88 (setReader method). There
> > it checks if input != ILLEGAL_STATE_READER then throws
> > IllegalStateException. Should it not be if input == ILLEGAL_STATE_READER?
> >
> > Regards,
> > Joe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: Possible issue with Tokenizer in lucene-analyzers-common-4.6.1

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi Joe,

in Lucene 4.6, the TokenStream/Tokenizer APIs got some additional state machine checks to ensure that consumers and subclasses of those abstract interfaces are implemented in a correct way - they are not easy to understand, because they are implemented in that way to ensure they don't affect performance. If your test case consumes the Tokenizer/TokenStream in a wrong way (e.g. missing to call reset() or setReader() at correct places), an IllegalStateException is thrown. The ILLEGAL_STATE_READER is there to ensure that the consumer gets a correct exception if it calls setReader() or reset() in the wrong order (or multiple times).

The checks in the base class are definitely OK, if you hit the IllegalStateException, your have some problems in your implementation of the Tokenizer/TokenStream interface (e.g. missing super() calls or calling reset() from inside setReader() or whatever). Or, the consumer does not respect the full documented workflow: http://lucene.apache.org/core/4_6_1/core/org/apache/lucene/analysis/TokenStream.html

If you have TokenFilters in your analysis chain, the source of error may also be missing super delegations in reset(), end(),... If you need further help, post your implementation of the consumer in your test case or post your analysis chain and custom Tokenizers. You may also post the stack trace in addition, because this helps to find out what call sequence you have.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Joe Wong [mailto:jwong@adacado.com]
> Sent: Thursday, March 20, 2014 8:58 PM
> To: java-user@lucene.apache.org
> Subject: Possible issue with Tokenizer in lucene-analyzers-common-4.6.1
> 
> Hi
> 
> We're planning to upgrade lucene-analyzers-commons 4.3.0 to  4.6.1 . While
> running our unit test with 4.6.1 it fails at
> org.apache.lucene.analysis.Tokenizer on line 88 (setReader method). There
> it checks if input != ILLEGAL_STATE_READER then throws
> IllegalStateException. Should it not be if input == ILLEGAL_STATE_READER?
> 
> Regards,
> Joe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org