You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by dokondr <do...@gmail.com> on 2012/12/25 19:17:34 UTC
TokenStream: How to get token text?
Hello,
Please, help. I am lost in TokenStream / Token / Analyzer API.
I am trying to figure out how to get _token_itself_ or token text while
looking at "Invoking the Analyzer" example (see example below and also at:
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/package-summary.html?is-external=true#package_description
)
Method "ts.reflectAsString(true))" returns lots of useful info:
org.apache.lucene.analysis.tokenattributes.CharTermAttribute#term=some,org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute#bytes=[73
6f 6d
65],org.apache.lucene.analysis.tokenattributes.OffsetAttribute#startOffset=0,org.apache.lucene.analysis.tokenattributes.OffsetAttribute#endOffset=4,org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute#positionIncrement=1,org.apache.lucene.analysis.tokenattributes.TypeAttribute#type=<ALPHANUM>,org.apache.lucene.analysis.tokenattributes.KeywordAttribute#keyword=false
Yet, how to get token itself? In this case "some" ?
Thanks!
------ Example in the documentation --------
Version matchVersion = Version.LUCENE_XY; // Substitute desired Lucene
version for XY
Analyzer analyzer = new StandardAnalyzer(matchVersion); // or any other
analyzer
TokenStream ts = analyzer.tokenStream("myfield", new StringReader("some
text goes here"));
OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
try {
ts.reset(); // Resets this stream to the beginning. (Required)
while (ts.incrementToken()) {
// Use AttributeSource.reflectAsString(boolean)
// for token stream debugging.
System.out.println("token: " + ts.reflectAsString(true));
System.out.println("token start offset: " +
offsetAtt.startOffset());
System.out.println(" token end offset: " + offsetAtt.endOffset());
}
ts.end(); // Perform end-of-stream operations, e.g. set the final
offset.
} finally {
ts.close(); // Release resources associated with this stream.
}
Re: TokenStream: How to get token text?
Posted by Steve Rowe <sa...@gmail.com>.
Hi Dima,
Did you see my response to your earlier email? I think it's what you're looking for:
http://markmail.org/message/jdcjxauj4odyuv7e
Steve
On Dec 25, 2012, at 1:17 PM, dokondr <do...@gmail.com> wrote:
> Hello,
> Please, help. I am lost in TokenStream / Token / Analyzer API.
> I am trying to figure out how to get _token_itself_ or token text while
> looking at "Invoking the Analyzer" example (see example below and also at:
> http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/analysis/package-summary.html?is-external=true#package_description
> )
>
> Method "ts.reflectAsString(true))" returns lots of useful info:
> org.apache.lucene.analysis.tokenattributes.CharTermAttribute#term=some,org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute#bytes=[73
> 6f 6d
> 65],org.apache.lucene.analysis.tokenattributes.OffsetAttribute#startOffset=0,org.apache.lucene.analysis.tokenattributes.OffsetAttribute#endOffset=4,org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute#positionIncrement=1,org.apache.lucene.analysis.tokenattributes.TypeAttribute#type=<ALPHANUM>,org.apache.lucene.analysis.tokenattributes.KeywordAttribute#keyword=false
>
> Yet, how to get token itself? In this case "some" ?
>
> Thanks!
>
> ------ Example in the documentation --------
>
> Version matchVersion = Version.LUCENE_XY; // Substitute desired Lucene
> version for XY
> Analyzer analyzer = new StandardAnalyzer(matchVersion); // or any other
> analyzer
> TokenStream ts = analyzer.tokenStream("myfield", new StringReader("some
> text goes here"));
> OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class);
>
> try {
> ts.reset(); // Resets this stream to the beginning. (Required)
> while (ts.incrementToken()) {
> // Use AttributeSource.reflectAsString(boolean)
> // for token stream debugging.
> System.out.println("token: " + ts.reflectAsString(true));
>
> System.out.println("token start offset: " +
> offsetAtt.startOffset());
> System.out.println(" token end offset: " + offsetAtt.endOffset());
> }
> ts.end(); // Perform end-of-stream operations, e.g. set the final
> offset.
> } finally {
> ts.close(); // Release resources associated with this stream.
> }
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: TokenStream: How to get token text?
Posted by dokondr <do...@gmail.com>.
Hi Steve,
Thanks for you help (just found your e-mail in list archive), your solution
works!
Below is complete working example... However, before finding your answer, I
hacked a straw-man solution, which is bad way to solve the problem:
// Hack out token - bad way!
String tmp = ts.reflectAsString(false);
String sameToken = (tmp.split(",")[0]).split("=")[1];
System.out.println("*** Same token : " + sameToken);
It is not a right way, I repeat and I give here just for fun.
---- Complete working example ----
Version matchVersion = Version.LUCENE_40; // Substitute desired Lucene
version for XY
Analyzer analyzer = new RussianAnalyzer(matchVersion); // or any other
analyzer
TokenStream ts = analyzer.tokenStream("myfield", new StringReader("some
text goes here"));
OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);
// To get token strings we need this:
CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
try {
ts.reset(); // Resets this stream to the beginning. (Required)
while (ts.incrementToken()) {
// Use AttributeSource.reflectAsString(boolean)
// for token stream debugging.
System.out.println("token: " + ts.reflectAsString(true));
// Right way to get tokens
String token = termAtt.toString();
System.out.println("*** Token: " + token);
// Hack out token - bad way!
String tmp = ts.reflectAsString(false);
String sameToken = (tmp.split(",")[0]).split("=")[1];
System.out.println("*** Same token : " + sameToken);
System.out.println("token start offset: " +
offsetAtt.startOffset());
System.out.println("token end offset: " + offsetAtt.endOffset());
}
ts.end(); // Perform end-of-stream operations, e.g. set the final
offset.
} finally {
ts.close(); // Release resources associated with this stream.
analyzer.close();
}
Hi Dima,
>
> The example code you mentioned in your other recent email is pretty close.
>
> The only thing you'd probably want to add is access to the
> CharTermAttribute:
>
> CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
>
> and then in the loop over ts.incrementToken(), you can get to the output
> tokens
> using termAtt.buffer() and termAtt.length(), or if you're going to
> Stringify
> tokens anyway, termAtt.toString().
>
> Steve
>