You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "Igal @ getRailo.org" <ig...@getrailo.org> on 2012/11/02 00:31:43 UTC
tokenizer's tokens
I'm trying to write a very simple method to show the different tokens
that come out of a tokenizer. when I call WhitespaceTokenizer's (or
LetterTokenizer's) incrementToken() method though I get an
ArrayIndexOutOfBoundsException (see below)
any ideas?
p.s. if I use StandardTokenizer it works.
java.lang.ArrayIndexOutOfBoundsException: -1
at java.lang.Character.codePointAtImpl(Character.java:4739)
at java.lang.Character.codePointAt(Character.java:4702)
at
org.apache.lucene.analysis.util.CharacterUtils$Java5CharacterUtils.codePointAt(CharacterUtils.java:164)
at
org.apache.lucene.analysis.util.CharTokenizer.incrementToken(CharTokenizer.java:166)
at test.Test1.tokenize(Test1.java:46)
at test.Test1.main(Test1.java:139)
class Test1 {
static Version v = Version.LUCENE_40;
static void tokenize( String s ) throws IOException {
Reader r = new StringReader( s );
Tokenizer t = new WhitespaceTokenizer( v, r );
CharTermAttribute attrTerm = t.getAttribute(
CharTermAttribute.class );
while ( t.incrementToken() ) {
String term = attrTerm.toString();
System.out.println( term );
}
}
public static void main( String[] args ) throws IOException {
String[] text = {
"The quick brown fox jumps over the lazy dog",
"Only the fool would take trouble to verify that his
sentence was composed of ten a's, three b's, four c's, four d's,
forty-six e's, sixteen f's, four g's, thirteen h's, fifteen i's, two
k's, nine l's, four m's, twenty-five n's, twenty-four o's, five p's,
sixteen r's, forty-one s's, thirty-seven t's, ten u's, eight v's, eight
w's, four x's, eleven y's, twenty-seven commas, twenty-three
apostrophes, seven hyphens and, last but not least, a single!",
};
for ( String s : text )
tokenize( s );
}
}
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: tokenizer's tokens
Posted by "Igal @ getRailo.org" <ig...@getrailo.org>.
thank you :)
On 11/1/2012 4:45 PM, Robert Muir wrote:
> this is intentional (since you have a bug in your code).
>
> you need to call reset(): see the tokenstream contract, step 2:
> http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/TokenStream.html
>
> On Thu, Nov 1, 2012 at 7:31 PM, Igal @ getRailo.org <ig...@getrailo.org> wrote:
>> I'm trying to write a very simple method to show the different tokens that
>> come out of a tokenizer. when I call WhitespaceTokenizer's (or
>> LetterTokenizer's) incrementToken() method though I get an
>> ArrayIndexOutOfBoundsException (see below)
>>
>> any ideas?
>>
>> p.s. if I use StandardTokenizer it works.
>>
>>
>> java.lang.ArrayIndexOutOfBoundsException: -1
>> at java.lang.Character.codePointAtImpl(Character.java:4739)
>> at java.lang.Character.codePointAt(Character.java:4702)
>> at
>> org.apache.lucene.analysis.util.CharacterUtils$Java5CharacterUtils.codePointAt(CharacterUtils.java:164)
>> at
>> org.apache.lucene.analysis.util.CharTokenizer.incrementToken(CharTokenizer.java:166)
>> at test.Test1.tokenize(Test1.java:46)
>> at test.Test1.main(Test1.java:139)
>>
>>
>> class Test1 {
>>
>> static Version v = Version.LUCENE_40;
>>
>>
>> static void tokenize( String s ) throws IOException {
>>
>> Reader r = new StringReader( s );
>>
>> Tokenizer t = new WhitespaceTokenizer( v, r );
>>
>> CharTermAttribute attrTerm = t.getAttribute(
>> CharTermAttribute.class );
>>
>> while ( t.incrementToken() ) {
>>
>> String term = attrTerm.toString();
>>
>> System.out.println( term );
>> }
>> }
>>
>>
>> public static void main( String[] args ) throws IOException {
>>
>> String[] text = {
>>
>> "The quick brown fox jumps over the lazy dog",
>> "Only the fool would take trouble to verify that his sentence
>> was composed of ten a's, three b's, four c's, four d's, forty-six e's,
>> sixteen f's, four g's, thirteen h's, fifteen i's, two k's, nine l's, four
>> m's, twenty-five n's, twenty-four o's, five p's, sixteen r's, forty-one s's,
>> thirty-seven t's, ten u's, eight v's, eight w's, four x's, eleven y's,
>> twenty-seven commas, twenty-three apostrophes, seven hyphens and, last but
>> not least, a single!",
>>
>> };
>>
>> for ( String s : text )
>> tokenize( s );
>>
>> }
>>
>> }
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: tokenizer's tokens
Posted by Robert Muir <rc...@gmail.com>.
this is intentional (since you have a bug in your code).
you need to call reset(): see the tokenstream contract, step 2:
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/TokenStream.html
On Thu, Nov 1, 2012 at 7:31 PM, Igal @ getRailo.org <ig...@getrailo.org> wrote:
> I'm trying to write a very simple method to show the different tokens that
> come out of a tokenizer. when I call WhitespaceTokenizer's (or
> LetterTokenizer's) incrementToken() method though I get an
> ArrayIndexOutOfBoundsException (see below)
>
> any ideas?
>
> p.s. if I use StandardTokenizer it works.
>
>
> java.lang.ArrayIndexOutOfBoundsException: -1
> at java.lang.Character.codePointAtImpl(Character.java:4739)
> at java.lang.Character.codePointAt(Character.java:4702)
> at
> org.apache.lucene.analysis.util.CharacterUtils$Java5CharacterUtils.codePointAt(CharacterUtils.java:164)
> at
> org.apache.lucene.analysis.util.CharTokenizer.incrementToken(CharTokenizer.java:166)
> at test.Test1.tokenize(Test1.java:46)
> at test.Test1.main(Test1.java:139)
>
>
> class Test1 {
>
> static Version v = Version.LUCENE_40;
>
>
> static void tokenize( String s ) throws IOException {
>
> Reader r = new StringReader( s );
>
> Tokenizer t = new WhitespaceTokenizer( v, r );
>
> CharTermAttribute attrTerm = t.getAttribute(
> CharTermAttribute.class );
>
> while ( t.incrementToken() ) {
>
> String term = attrTerm.toString();
>
> System.out.println( term );
> }
> }
>
>
> public static void main( String[] args ) throws IOException {
>
> String[] text = {
>
> "The quick brown fox jumps over the lazy dog",
> "Only the fool would take trouble to verify that his sentence
> was composed of ten a's, three b's, four c's, four d's, forty-six e's,
> sixteen f's, four g's, thirteen h's, fifteen i's, two k's, nine l's, four
> m's, twenty-five n's, twenty-four o's, five p's, sixteen r's, forty-one s's,
> thirty-seven t's, ten u's, eight v's, eight w's, four x's, eleven y's,
> twenty-seven commas, twenty-three apostrophes, seven hyphens and, last but
> not least, a single!",
>
> };
>
> for ( String s : text )
> tokenize( s );
>
> }
>
> }
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org