You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Mindaugas Žakšauskas <mi...@gmail.com> on 2011/04/18 18:15:46 UTC
StandardTokenizer question
Hi,
Given the code is running under Lucene 3.0.1
8<------------------------------------------------------------------------------
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import org.apache.lucene.util.Version;
public class MyAnalyzer extends Analyzer {
public TokenStream tokenStream(String fieldName, Reader reader) {
return
new StopFilter(
true,
new StandardTokenizer(Version.LUCENE_30, reader),
StopAnalyzer.ENGLISH_STOP_WORDS_SET
);
}
private static void printTokens(String string) throws IOException {
TokenStream ts = new MyAnalyzer().tokenStream("default", new
StringReader(string));
TermAttribute termAtt = ts.getAttribute(TermAttribute.class);
while(ts.incrementToken()) {
System.out.print(termAtt.term());
System.out.print(" ");
}
System.out.println();
}
public static void main(String[] args) throws IOException {
printTokens("one_two_three"); // prints "one two three"
printTokens("four4_five5_six6"); // prints "four4_five5_six6"
printTokens("seven7_eight_nine"); // prints "seven7_eight nine"
printTokens("ten_eleven11_twelve"); // prints "ten_eleven11_twelve"
}
}
8<------------------------------------------------------------------------------
I can understand why "one_two_three" and "four4_five5_six6" are
tokenized as they are, as this is explained in the StandardTokenizer
class header Javadoc. But the other two cases are more subtle and I'm
not quite sure I get the idea.
If appearance of "7" after "seven" makes it joint token with "eight"
but separate to "nine", why is "ten" glued to "eleven11"?
Is there any standard and/or easy way to make StandardTokenizer always
split on the underscore?
Thanks in advance.
m.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org