You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Mindaugas Žakšauskas <mi...@gmail.com> on 2011/04/18 18:15:46 UTC

StandardTokenizer question

Hi,

Given the code is running under Lucene 3.0.1

8<------------------------------------------------------------------------------
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import org.apache.lucene.util.Version;

public class MyAnalyzer extends Analyzer {

    public TokenStream tokenStream(String fieldName, Reader reader) {
        return
                new StopFilter(
                        true,
                        new StandardTokenizer(Version.LUCENE_30, reader),
                        StopAnalyzer.ENGLISH_STOP_WORDS_SET
                );
    }

    private static void printTokens(String string) throws IOException {
        TokenStream ts = new MyAnalyzer().tokenStream("default", new
StringReader(string));
        TermAttribute termAtt = ts.getAttribute(TermAttribute.class);
        while(ts.incrementToken()) {
            System.out.print(termAtt.term());
            System.out.print(" ");
        }
        System.out.println();
    }

    public static void main(String[] args) throws IOException {
        printTokens("one_two_three");           // prints "one two three"
        printTokens("four4_five5_six6");        // prints "four4_five5_six6"
        printTokens("seven7_eight_nine");       // prints "seven7_eight nine"
        printTokens("ten_eleven11_twelve");     // prints "ten_eleven11_twelve"
    }
}

8<------------------------------------------------------------------------------

I can understand why "one_two_three" and "four4_five5_six6" are
tokenized as they are, as this is explained in the StandardTokenizer
class header Javadoc. But the other two cases are more subtle and I'm
not quite sure I get the idea.

If appearance of "7" after "seven" makes it joint token with "eight"
but separate to "nine", why is "ten" glued to "eleven11"?

Is there any standard and/or easy way to make StandardTokenizer always
split on the underscore?

Thanks in advance.

m.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org