You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Maksym Krasovskiy <ma...@ciklum.com> on 2013/01/15 12:28:31 UTC
Lucene 4.0 WhitespaceAnalyzer problem
Hi!
I try to use WhitespaceAnalyzer from Lucene 4.0 for splitting strings to words.
I wrote smal test:
@Test
public void whitespaceAnalyzerTest() throws IOException {
String string = "sdfdsf sdfsdf sd sdf ";
Analyzer wa = new WhitespaceAnalyzer(Version.LUCENE_40);
TokenStream tokenStream = wa.tokenStream("", new StringReader(string));
while (tokenStream.incrementToken()) {
System.out.println(tokenStream.getAttribute(CharTermAttribute.class).toString());
}
}
but got exception:
java.lang.ArrayIndexOutOfBoundsException: -1
at java.lang.Character.codePointAtImpl(Character.java:2405)
at java.lang.Character.codePointAt(Character.java:2369)
at org.apache.lucene.analysis.util.CharacterUtils$Java5CharacterUtils.codePointAt(CharacterUtils.java:164)
at org.apache.lucene.analysis.util.CharTokenizer.incrementToken(CharTokenizer.java:166)
at com.maxx.tests.lucene40test.analyzer.AnalyzerTest.whitespaceAnalyzerTest(AnalyzerTest.java:93)
...
If I change WhitespaceAnalyzer to StandardAnalyzer it work correctly.
For workaround I can create StandardAnalyzer without stopwords, but why my code doesn’t work?
--
Krasovskiy Maxim
Re: Lucene 4.0 WhitespaceAnalyzer problem
Posted by Alon Muchnick <al...@datonics.com>.
hi Maxim ,
you need to reset the tokenStream before the while loop - tokenStream .reset
()
check out
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/package-summary.html
look under "invoking the analyzer" :
"ts.reset(); // Resets this stream to the beginning. (Required)"
Alon
On Tue, Jan 15, 2013 at 1:28 PM, Maksym Krasovskiy <ma...@ciklum.com> wrote:
> Hi!
> I try to use WhitespaceAnalyzer from Lucene 4.0 for splitting strings to
> words.
> I wrote smal test:
> @Test
> public void whitespaceAnalyzerTest() throws IOException {
> String string = "sdfdsf sdfsdf sd sdf ";
> Analyzer wa = new WhitespaceAnalyzer(Version.LUCENE_40);
> TokenStream tokenStream = wa.tokenStream("", new StringReader(string));
> while (tokenStream.incrementToken()) {
>
> System.out.println(tokenStream.getAttribute(CharTermAttribute.class).toString());
> }
> }
>
> but got exception:
> java.lang.ArrayIndexOutOfBoundsException: -1
> at java.lang.Character.codePointAtImpl(Character.java:2405)
> at java.lang.Character.codePointAt(Character.java:2369)
> at
> org.apache.lucene.analysis.util.CharacterUtils$Java5CharacterUtils.codePointAt(CharacterUtils.java:164)
> at
> org.apache.lucene.analysis.util.CharTokenizer.incrementToken(CharTokenizer.java:166)
> at
> com.maxx.tests.lucene40test.analyzer.AnalyzerTest.whitespaceAnalyzerTest(AnalyzerTest.java:93)
> ...
>
>
> If I change WhitespaceAnalyzer to StandardAnalyzer it work correctly.
> For workaround I can create StandardAnalyzer without stopwords, but why
> my code doesn’t work?
>
>
>
> --
> Krasovskiy Maxim
>