You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by joe_coder <co...@gmail.com> on 2009/08/17 09:24:37 UTC

Lucene Tokenizer + Merge terms

I am using a custom analyzer:


    public TokenStream tokenStream(String fieldName, Reader reader) {
        StandardTokenizer tokenStream = new StandardTokenizer(reader);
        tokenStream.setMaxTokenLength(maxTokenLength);

        TokenStream result = new ASCIIFoldingFilter(tokenStream);
        result = new StandardFilter(result);
        result = new LengthFilter(result, 3, maxTokenLength);
        result = new LowerCaseFilter(result);
        result = new StopFilter(true, result, stopSet);
        result = new PorterStemFilter(result);
        return result;
    } 

My question is around creating a new tokenizer which can detect people
name/place names etc(I will be able to lookup on my local db to find such
cases). E.g: If a text has "Joe Coder is in New York", then instead of
termvectors [Joe][Coder][New][York], I would like to have term vectors as
[Joe Coder][New York]

Are there any tokenzier in lucene that I can extend to perform this
functionality? Any other pointers?
-- 
View this message in context: http://www.nabble.com/Lucene-Tokenizer-%2B-Merge-terms-tp25002240p25002240.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org