You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Beady Geraghty <be...@gmail.com> on 2005/09/21 17:02:28 UTC

standardTokenizer - how to terminate at End of Stream

Could someone tell me how to use the StandardTokenizer properly ?
 I thought that if the tokenizer.getNextToken() returns null, then it is
the end of stream. I have a loop that tries to get the next token until
it is null. But the loop doesn't terminate.
I tried to termintae the loop by t.kind == 0, and it seems to have stopped
upon the end of stream. I am not sure what t.kind really is.
The code mentioned that it is defined in Constants.java, and I looked
that up, but it is apparent not the right file. Maybe I am pointing to
a wrong directory.


StandardTokenizer tokenizer = new StandardTokenizer( r ); // r is a reader
int count = 0;
Token t = tokenizer .getNextToken();

while (t != null) {
count++;
//if (t.kind == 0)
// break;
System.out.println( t );
t = tokenizer.getNextToken() ;
System.out.println(count);
}
System.out.println( "done");
 Thank you for any input.

Re: standardTokenizer - how to terminate at End of Stream

Posted by Beady Geraghty <be...@gmail.com>.

Thank you for your response.
That was my original goal.
 On 9/21/05, Chris Hostetter <ho...@fucit.org> wrote:
>
>
> : Since I used the StandAnalyzer when I originally created the index,
> : I therefore use the StandardTokenizer to tokenize the input stream.
> : Is there a better way to do what I try to do ?
> : From your comment below, it appears that I should just use next()
> instead
>
> if your goal is to recreate the tokens you get from using
> StandardAnalyzer, then don't use StandardTokenizer -- use
> StandardAnalyzer. It does other things besides tokenizing. get the
> TokenStream from StandardAnalyzer and it's next() method should do what
> you want.
>
>
>
> None of which should imply that this is the best way to achieve your goal
> -- i'm sure the highlighter package will do what you want, but i've never
> used it personally.
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: standardTokenizer - how to terminate at End of Stream

Posted by Chris Hostetter <ho...@fucit.org>.

:  Since I used the StandAnalyzer when I originally created the index,
: I therefore use the StandardTokenizer to tokenize the input stream.
:  Is there a better way to do what I try to do ?
:   From your comment below, it appears that I should just use next() instead

if your goal is to recreate the tokens you get from using
StandardAnalyzer, then don't use StandardTokenizer -- use
StandardAnalyzer.  It does other things besides tokenizing.  get the
TokenStream from StandardAnalyzer and it's next() method should do what
you want.



None of which should imply that this is the best way to achieve your goal
-- i'm sure the highlighter package will do what you want, but i've never
used it personally.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: standardTokenizer - how to terminate at End of Stream

Posted by Beady Geraghty <be...@gmail.com>.

Thank you for the response.
 I was trying to do something really simple - I want to extract the context
for
terms and phrases from files that satisfy some (many) queries.
I *know* that file test.txt is a hit (because I queried the index, and
it tells me that test.txt satisfies the query). Then, I open the file, and
use Lucene's
standardTokenizer to tokenize the input. I get a token at a time
to see which token or consecutive tokens match the terms/phrases.
Then I extract the context surrounding these terms.
 I didn't try the highlighter because I don't really need to "highlight",
and I didn't
look clearly whether some of the classes provided in the package would
already do
what I need. (Although, I would imagine this is something many people would
have done what I try to do already. It appears to have a fragmenter, and I
don't
know if that is something I need.)
 Since I used the StandAnalyzer when I originally created the index,
I therefore use the StandardTokenizer to tokenize the input stream.
 Is there a better way to do what I try to do ?
  From your comment below, it appears that I should just use next() instead
of
getNextToken(), is that correct ?
 Thanks

 On 9/21/05, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>
> Could you elaborate on what you're trying to do, please?
>
> Using StandardTokenizer in this low-level fashion is practically
> unheard of, so I think knowing what you're attempting to do will help
> us help you :)
>
> Erik
>
>
>

Re: standardTokenizer - how to terminate at End of Stream

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

Could you elaborate on what you're trying to do, please?

Using StandardTokenizer in this low-level fashion is practically  
unheard of, so I think knowing what you're attempting to do will help  
us help you :)

     Erik


On Sep 21, 2005, at 12:17 PM, Beady Geraghty wrote:

> I see some definitions in StandardTokenizerConstants.java
> Perhaps these are the values for t.kind.
>  Perhaps, I was confused between between the usage of
> getNextToken() and next() in the standard analyzer.
> When should one use getNextToken() instead of next()
>  I am just starting to use Lucene, so please excuse these
> simple questions.
>  Thanks
>
>
>  On 9/21/05, Beady Geraghty <be...@gmail.com> wrote:
>
>>
>> Could someone tell me how to use the StandardTokenizer properly ?
>>  I thought that if the tokenizer.getNextToken() returns null, then  
>> it is
>> the end of stream. I have a loop that tries to get the next token  
>> until
>> it is null. But the loop doesn't terminate.
>> I tried to termintae the loop by t.kind == 0, and it seems to have  
>> stopped
>> upon the end of stream. I am not sure what t.kind really is.
>> The code mentioned that it is defined in Constants.java, and I looked
>> that up, but it is apparent not the right file. Maybe I am  
>> pointing to
>> a wrong directory.
>>
>>
>> StandardTokenizer tokenizer = new StandardTokenizer( r ); // r is  
>> a reader
>> int count = 0;
>> Token t = tokenizer .getNextToken();
>>
>> while (t != null) {
>> count++;
>> //if (t.kind == 0)
>> // break;
>> System.out.println( t );
>> t = tokenizer.getNextToken() ;
>> System.out.println(count);
>> }
>> System.out.println ( "done");
>>  Thank you for any input.
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: standardTokenizer - how to terminate at End of Stream

Posted by Beady Geraghty <be...@gmail.com>.

I see some definitions in StandardTokenizerConstants.java
Perhaps these are the values for t.kind.
 Perhaps, I was confused between between the usage of
getNextToken() and next() in the standard analyzer.
When should one use getNextToken() instead of next()
 I am just starting to use Lucene, so please excuse these
simple questions.
 Thanks


 On 9/21/05, Beady Geraghty <be...@gmail.com> wrote:
>
> Could someone tell me how to use the StandardTokenizer properly ?
>  I thought that if the tokenizer.getNextToken() returns null, then it is
> the end of stream. I have a loop that tries to get the next token until
> it is null. But the loop doesn't terminate.
> I tried to termintae the loop by t.kind == 0, and it seems to have stopped
> upon the end of stream. I am not sure what t.kind really is.
> The code mentioned that it is defined in Constants.java, and I looked
> that up, but it is apparent not the right file. Maybe I am pointing to
> a wrong directory.
>
>
> StandardTokenizer tokenizer = new StandardTokenizer( r ); // r is a reader
> int count = 0;
> Token t = tokenizer .getNextToken();
>
> while (t != null) {
> count++;
> //if (t.kind == 0)
> // break;
> System.out.println( t );
> t = tokenizer.getNextToken() ;
> System.out.println(count);
> }
> System.out.println ( "done");
>  Thank you for any input.
>