You are viewing a plain text version of this content. The canonical link for it is here.
Posted to infrastructure-issues@apache.org by "Srimanth Bangalore Krishnamurthy (JIRA)" <ji...@apache.org> on 2015/10/12 15:14:05 UTC
[jira] [Created] (INFRA-10577) Tokenizing Chinese strings using
lucene Chinese analyzer
Srimanth Bangalore Krishnamurthy created INFRA-10577:
--------------------------------------------------------
Summary: Tokenizing Chinese strings using lucene Chinese analyzer
Key: INFRA-10577
URL: https://issues.apache.org/jira/browse/INFRA-10577
Project: Infrastructure
Issue Type: Bug
Reporter: Srimanth Bangalore Krishnamurthy
The text that is indexed: 校准的卡尔曼滤波器
Query string: 卡尔曼滤波
The exact query string is present in an indexed document on SOLR. But it doesn't return this document.
SOLR analysis shows on index:
的卡
尔
曼
滤波器
but the queried terms show:
卡
尔
曼
滤波
The other characters appear to be influencing how 卡尔曼滤波 is tokenized.
Is this an expected behavior??
Here are the things I have tried
1) I tried a couple of different tokenizers and the behavior is the same.
2) I tried to explore the option of dictionary but I found this:
https://issues.apache.org/jira/browse/LUCENE-1817
3) I tried using the following with text_zh for chinese documents.
a) solr.KeywordMarkerFilterFactory
b) solr.StemmerOverrideFilterFactory
c) Adding to synonyms.txt
All these seem to work only with text_en and have no effect for text_zh
Are there any options I can try to make sure that the query returns this document?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)