You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by tiffany <ti...@future.co.jp> on 2011/06/03 09:40:24 UTC
How to search camel case words using CJKTokenizer
Hi all,
I'm using CJKTokenizerFactory tokenizer to handle text which contains both
Japanese and alphabet words. However, I noticed that CJKTokenizerFactory
converts alphabet to lowercase, so that I cannot use
WordDelimiterFilterFactory filter with splitOnCaseChange property for camel
case words.
I changed to NGramTokenizerFactory (2-gram), but it only parses first 1024
characters. Because of that, I cannot use NGramTokenizerFactory, neither.
I tried the following two settings and both of them seem working fine, but I
don't know if these are good or not, or if there are some other better
solutions.
1)
<tokenizer class="solr.CJKTokenizerFactory" />
<filter class="solr.NGramFilterFactory" maxGramSize="2"
minGramSize="2" />
2)
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.NGramFilterFactory" maxGramSize="1"
minGramSize="1" />
If anyone can give me any advice, it would be nice.
Thank you.
Tiffany
--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-search-camel-case-words-using-CJKTokenizer-tp3018853p3018853.html
Sent from the Solr - User mailing list archive at Nabble.com.