You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Zheng Lin Edwin Yeo <ed...@gmail.com> on 2015/11/02 03:23:44 UTC

Re: Is it possible to use JiebaTokenizer for multilingual documents?

Here's my configuration in schmea.xml for the JiebaTokenizerFactory.


<fieldType name="text_chinese2" class="solr.TextField"
positionIncrementGap="100">
  <analyzer type="index">
<tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
 segMode="SEARCH"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
<filter class="solr.StopFilterFactory"
words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
maxGramSize="15"/>
  </analyzer>
  <analyzer type="query">
<tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory"
 segMode="SEARCH"/>
<filter class="solr.CJKWidthFilterFactory"/>
<filter class="solr.CJKBigramFilterFactory"/>
<filter class="solr.StopFilterFactory"
words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
          </analyzer>
  </fieldType>


<field name="content" type="text_chinese2" indexed="true" stored="true"
omitNorms="true" termVectors="true"/>


Could there be any problems that might be causing the English characters
issue?

Regards,
Edwin


On 29 October 2015 at 17:51, Zheng Lin Edwin Yeo <ed...@gmail.com>
wrote:

> I would like to check, is it possible to use JiebaTokenizerFactory to
> index Multilingual documents in Solr?
>
> I found that JiebaTokenizerFactory works better for Chinese characters as
> compared to HMMChineseTokenizerFactory.
>
> However, for English characters, the JiebaTokenizerFactory is cutting the
> words at the wrong place. For example, it will cut the word "water" as
> follows:
> *w|at|er*
>
> It means that Solr will search for 3 separate words of "w", "at" and "er"
> instead of the entire word "water".
>
> Is there anyway to solve this problem, besides using a separate field for
> English and Chinese characters?
>
> Regards,
> Edwin
>