You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "Igal @ getRailo.org" <ig...@getrailo.org> on 2012/11/02 00:31:43 UTC

tokenizer's tokens

I'm trying to write a very simple method to show the different tokens 
that come out of a tokenizer.  when I call WhitespaceTokenizer's (or 
LetterTokenizer's) incrementToken() method though I get an 
ArrayIndexOutOfBoundsException (see below)

any ideas?

p.s.  if I use StandardTokenizer it works.


java.lang.ArrayIndexOutOfBoundsException: -1
     at java.lang.Character.codePointAtImpl(Character.java:4739)
     at java.lang.Character.codePointAt(Character.java:4702)
     at 
org.apache.lucene.analysis.util.CharacterUtils$Java5CharacterUtils.codePointAt(CharacterUtils.java:164)
     at 
org.apache.lucene.analysis.util.CharTokenizer.incrementToken(CharTokenizer.java:166)
     at test.Test1.tokenize(Test1.java:46)
     at test.Test1.main(Test1.java:139)


class Test1 {

     static Version v = Version.LUCENE_40;


     static void tokenize( String s ) throws IOException {

         Reader r = new StringReader( s );

         Tokenizer t = new WhitespaceTokenizer( v, r );

         CharTermAttribute   attrTerm = t.getAttribute( 
CharTermAttribute.class );

         while ( t.incrementToken() ) {

             String term = attrTerm.toString();

             System.out.println( term );
         }
     }


     public static void main( String[] args ) throws IOException {

         String[] text = {

             "The quick brown fox jumps over the lazy dog",
             "Only the fool would take trouble to verify that his 
sentence was composed of ten a's, three b's, four c's, four d's, 
forty-six e's, sixteen f's, four g's, thirteen h's, fifteen i's, two 
k's, nine l's, four m's, twenty-five n's, twenty-four o's, five p's, 
sixteen r's, forty-one s's, thirty-seven t's, ten u's, eight v's, eight 
w's, four x's, eleven y's, twenty-seven commas, twenty-three 
apostrophes, seven hyphens and, last but not least, a single!",

         };

         for ( String s : text )
             tokenize( s );

     }

}

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: tokenizer's tokens

Posted by "Igal @ getRailo.org" <ig...@getrailo.org>.
thank you :)


On 11/1/2012 4:45 PM, Robert Muir wrote:
> this is intentional (since you have a bug in your code).
>
> you need to call reset(): see the tokenstream contract, step 2:
> http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/TokenStream.html
>
> On Thu, Nov 1, 2012 at 7:31 PM, Igal @ getRailo.org <ig...@getrailo.org> wrote:
>> I'm trying to write a very simple method to show the different tokens that
>> come out of a tokenizer.  when I call WhitespaceTokenizer's (or
>> LetterTokenizer's) incrementToken() method though I get an
>> ArrayIndexOutOfBoundsException (see below)
>>
>> any ideas?
>>
>> p.s.  if I use StandardTokenizer it works.
>>
>>
>> java.lang.ArrayIndexOutOfBoundsException: -1
>>      at java.lang.Character.codePointAtImpl(Character.java:4739)
>>      at java.lang.Character.codePointAt(Character.java:4702)
>>      at
>> org.apache.lucene.analysis.util.CharacterUtils$Java5CharacterUtils.codePointAt(CharacterUtils.java:164)
>>      at
>> org.apache.lucene.analysis.util.CharTokenizer.incrementToken(CharTokenizer.java:166)
>>      at test.Test1.tokenize(Test1.java:46)
>>      at test.Test1.main(Test1.java:139)
>>
>>
>> class Test1 {
>>
>>      static Version v = Version.LUCENE_40;
>>
>>
>>      static void tokenize( String s ) throws IOException {
>>
>>          Reader r = new StringReader( s );
>>
>>          Tokenizer t = new WhitespaceTokenizer( v, r );
>>
>>          CharTermAttribute   attrTerm = t.getAttribute(
>> CharTermAttribute.class );
>>
>>          while ( t.incrementToken() ) {
>>
>>              String term = attrTerm.toString();
>>
>>              System.out.println( term );
>>          }
>>      }
>>
>>
>>      public static void main( String[] args ) throws IOException {
>>
>>          String[] text = {
>>
>>              "The quick brown fox jumps over the lazy dog",
>>              "Only the fool would take trouble to verify that his sentence
>> was composed of ten a's, three b's, four c's, four d's, forty-six e's,
>> sixteen f's, four g's, thirteen h's, fifteen i's, two k's, nine l's, four
>> m's, twenty-five n's, twenty-four o's, five p's, sixteen r's, forty-one s's,
>> thirty-seven t's, ten u's, eight v's, eight w's, four x's, eleven y's,
>> twenty-seven commas, twenty-three apostrophes, seven hyphens and, last but
>> not least, a single!",
>>
>>          };
>>
>>          for ( String s : text )
>>              tokenize( s );
>>
>>      }
>>
>> }
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: tokenizer's tokens

Posted by Robert Muir <rc...@gmail.com>.
this is intentional (since you have a bug in your code).

you need to call reset(): see the tokenstream contract, step 2:
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/TokenStream.html

On Thu, Nov 1, 2012 at 7:31 PM, Igal @ getRailo.org <ig...@getrailo.org> wrote:
> I'm trying to write a very simple method to show the different tokens that
> come out of a tokenizer.  when I call WhitespaceTokenizer's (or
> LetterTokenizer's) incrementToken() method though I get an
> ArrayIndexOutOfBoundsException (see below)
>
> any ideas?
>
> p.s.  if I use StandardTokenizer it works.
>
>
> java.lang.ArrayIndexOutOfBoundsException: -1
>     at java.lang.Character.codePointAtImpl(Character.java:4739)
>     at java.lang.Character.codePointAt(Character.java:4702)
>     at
> org.apache.lucene.analysis.util.CharacterUtils$Java5CharacterUtils.codePointAt(CharacterUtils.java:164)
>     at
> org.apache.lucene.analysis.util.CharTokenizer.incrementToken(CharTokenizer.java:166)
>     at test.Test1.tokenize(Test1.java:46)
>     at test.Test1.main(Test1.java:139)
>
>
> class Test1 {
>
>     static Version v = Version.LUCENE_40;
>
>
>     static void tokenize( String s ) throws IOException {
>
>         Reader r = new StringReader( s );
>
>         Tokenizer t = new WhitespaceTokenizer( v, r );
>
>         CharTermAttribute   attrTerm = t.getAttribute(
> CharTermAttribute.class );
>
>         while ( t.incrementToken() ) {
>
>             String term = attrTerm.toString();
>
>             System.out.println( term );
>         }
>     }
>
>
>     public static void main( String[] args ) throws IOException {
>
>         String[] text = {
>
>             "The quick brown fox jumps over the lazy dog",
>             "Only the fool would take trouble to verify that his sentence
> was composed of ten a's, three b's, four c's, four d's, forty-six e's,
> sixteen f's, four g's, thirteen h's, fifteen i's, two k's, nine l's, four
> m's, twenty-five n's, twenty-four o's, five p's, sixteen r's, forty-one s's,
> thirty-seven t's, ten u's, eight v's, eight w's, four x's, eleven y's,
> twenty-seven commas, twenty-three apostrophes, seven hyphens and, last but
> not least, a single!",
>
>         };
>
>         for ( String s : text )
>             tokenize( s );
>
>     }
>
> }
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org