You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Maksym Krasovskiy <ma...@ciklum.com> on 2013/01/15 12:28:31 UTC

Lucene 4.0 WhitespaceAnalyzer problem

Hi!
I try to use WhitespaceAnalyzer from Lucene 4.0  for splitting strings to words.
I wrote smal test:
@Test
public void whitespaceAnalyzerTest() throws IOException {
    String string = "sdfdsf sdfsdf sd sdf ";
    Analyzer wa = new WhitespaceAnalyzer(Version.LUCENE_40);
    TokenStream tokenStream = wa.tokenStream("", new StringReader(string));
    while (tokenStream.incrementToken()) {
        System.out.println(tokenStream.getAttribute(CharTermAttribute.class).toString());
    }
}

but got exception:
java.lang.ArrayIndexOutOfBoundsException: -1
    at java.lang.Character.codePointAtImpl(Character.java:2405)
    at java.lang.Character.codePointAt(Character.java:2369)
    at org.apache.lucene.analysis.util.CharacterUtils$Java5CharacterUtils.codePointAt(CharacterUtils.java:164)
    at org.apache.lucene.analysis.util.CharTokenizer.incrementToken(CharTokenizer.java:166)
    at com.maxx.tests.lucene40test.analyzer.AnalyzerTest.whitespaceAnalyzerTest(AnalyzerTest.java:93)
    ...


If I change WhitespaceAnalyzer to StandardAnalyzer  it work correctly. 
For workaround I can create StandardAnalyzer  without stopwords, but why my code doesn’t work?



--
Krasovskiy Maxim

Re: Lucene 4.0 WhitespaceAnalyzer problem

Posted by Alon Muchnick <al...@datonics.com>.
hi Maxim ,

you need to reset the tokenStream before the while loop - tokenStream .reset
()

check out
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/package-summary.html

 look under "invoking the analyzer" :

"ts.reset(); // Resets this stream to the beginning. (Required)"


Alon


On Tue, Jan 15, 2013 at 1:28 PM, Maksym Krasovskiy <ma...@ciklum.com> wrote:

> Hi!
> I try to use WhitespaceAnalyzer from Lucene 4.0  for splitting strings to
> words.
> I wrote smal test:
> @Test
> public void whitespaceAnalyzerTest() throws IOException {
>     String string = "sdfdsf sdfsdf sd sdf ";
>     Analyzer wa = new WhitespaceAnalyzer(Version.LUCENE_40);
>     TokenStream tokenStream = wa.tokenStream("", new StringReader(string));
>     while (tokenStream.incrementToken()) {
>
> System.out.println(tokenStream.getAttribute(CharTermAttribute.class).toString());
>     }
> }
>
> but got exception:
> java.lang.ArrayIndexOutOfBoundsException: -1
>     at java.lang.Character.codePointAtImpl(Character.java:2405)
>     at java.lang.Character.codePointAt(Character.java:2369)
>     at
> org.apache.lucene.analysis.util.CharacterUtils$Java5CharacterUtils.codePointAt(CharacterUtils.java:164)
>     at
> org.apache.lucene.analysis.util.CharTokenizer.incrementToken(CharTokenizer.java:166)
>     at
> com.maxx.tests.lucene40test.analyzer.AnalyzerTest.whitespaceAnalyzerTest(AnalyzerTest.java:93)
>     ...
>
>
> If I change WhitespaceAnalyzer to StandardAnalyzer  it work correctly.
> For workaround I can create StandardAnalyzer  without stopwords, but why
> my code doesn’t work?
>
>
>
> --
> Krasovskiy Maxim
>