You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Felix Stanley <fe...@globalsources.com> on 2017/04/27 02:55:35 UTC

Indexing and Querying chinese at SOLR 6.4.2

Hi, I have been facing some issue in indexing and querying chinese character field using cjx analyzer. Here is what I've done:

I defined a new field and field type at my schema.xml :

<field name="test_chinese" type="text_cjk" indexed="true" stored="true" multiValued="false"/>

<fieldType name="text_cjk" class="solr.TextField" positionIncrementGap="100"> 
<analyzer> 
<tokenizer class="solr.StandardTokenizerFactory"/> 
<filter class="solr.CJKWidthFilterFactory"/> 
<filter class="solr.LowerCaseFilterFactory"/> 
<filter class="solr.CJKBigramFilterFactory"/> 
</analyzer> 
</fieldType>

and then I indexed the following documents :

P_ProductId P_SupplierId test_chinese P_CategoryName 
1000366140 6008801195176 胡志明 Hồ Chí Minh 
1000366141 6008801195177 胡志 Ho Chi 
1000366142 6008801195178 明目 eyesight 
1000366143 6008801195179 眼目 eyes 
1000366144 6008801195180 杂志中的 magazines 
1000366145 6008801195180 起明显 the aparent

Based on the query analysis at SOLR admin, I would get only 2 matches (胡志明 Hồ Chí Minh AND 胡志 Ho Chi ) if i search for 胡志明 (Ho Chi Minh) due to the n-gram nature of CJKBigramFilterFactory.

However, querying through URL e.g:  <http://localhost:8983/solr/product/select?q=%E8%83%A1%E5%BF%97%E6%98%8E> http://localhost:8983/solr/product/select?q=胡志明 gave me 5 matches:

000366140 6008801195176 胡志明 Hồ Chí Minh 
1000366141 6008801195177 胡志 Ho Chi 
1000366142 6008801195178 明目 eyesight 
1000366144 6008801195180 杂志中的 magazines 
1000366145 6008801195180 起明显 the aparent

Any idea what had gone wrong here? is there any special encoding that has to be done in the URL?

This article might give a better idea, the article clearly mentioned that with CJK, searching for 胡志明 Hồ Chí Minh would return me 2 results:

 <http://opensourceconnections.com/blog/2011/12/23/indexing-chinese-in-solr/> http://opensourceconnections.com/blog/2011/12/23/indexing-chinese-in-solr/

Thank you so much

 


----------------------
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 

This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.