You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by tiffany <ti...@future.co.jp> on 2011/06/03 09:40:24 UTC

How to search camel case words using CJKTokenizer

Hi all,

I'm using CJKTokenizerFactory tokenizer to handle text which contains both
Japanese and alphabet words.  However, I noticed that CJKTokenizerFactory
converts alphabet to lowercase, so that I cannot use
WordDelimiterFilterFactory filter with splitOnCaseChange property for camel
case words.

I changed to NGramTokenizerFactory (2-gram), but it only parses first 1024
characters. Because of that, I cannot use NGramTokenizerFactory, neither.

I tried the following two settings and both of them seem working fine, but I
don't know if these are good or not, or if there are some other better
solutions.

1)
        <tokenizer class="solr.CJKTokenizerFactory" />
        <filter class="solr.NGramFilterFactory" maxGramSize="2"
minGramSize="2" />

2)
        <tokenizer class="solr.StandardTokenizerFactory" />
        <filter class="solr.NGramFilterFactory" maxGramSize="1"
minGramSize="1" />

If anyone can give me any advice, it would be nice.

Thank you.

Tiffany

--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-search-camel-case-words-using-CJKTokenizer-tp3018853p3018853.html
Sent from the Solr - User mailing list archive at Nabble.com.