You are viewing a plain text version of this content. The canonical link for it is here.

Posted to infrastructure-issues@apache.org by "Srimanth Bangalore Krishnamurthy (JIRA)" <ji...@apache.org> on 2015/10/12 15:14:05 UTC

[jira] [Created] (INFRA-10577) Tokenizing Chinese strings using lucene Chinese analyzer

Srimanth Bangalore Krishnamurthy created INFRA-10577:
--------------------------------------------------------

             Summary: Tokenizing Chinese strings using lucene Chinese analyzer
                 Key: INFRA-10577
                 URL: https://issues.apache.org/jira/browse/INFRA-10577
             Project: Infrastructure
          Issue Type: Bug
            Reporter: Srimanth Bangalore Krishnamurthy


The text that is indexed: 校准的卡尔曼滤波器
Query string: 卡尔曼滤波

The exact query string is present in an indexed document on SOLR. But it doesn't return this document. 

SOLR analysis shows on index: 
的卡 
尔 
曼 
滤波器 

but the queried terms show: 
卡 
尔 
曼 
滤波 

The other characters appear to be influencing how 卡尔曼滤波 is tokenized. 

Is this an expected behavior??

Here are the things I have tried
1) I tried a couple of different tokenizers and the behavior is the same. 

2) I tried to explore the option of dictionary but I found this:
https://issues.apache.org/jira/browse/LUCENE-1817

3) I tried using the following with text_zh for chinese documents. 
a) solr.KeywordMarkerFilterFactory
b) solr.StemmerOverrideFilterFactory
c) Adding to synonyms.txt 
All these seem to work only with text_en and have no effect for text_zh

Are there any options I can try to make sure that the query returns this document?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)