You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by go canal <go...@yahoo.com> on 2010/06/27 08:47:39 UTC

Chinese chars are not indexed ?

Hello,
I enter Chinese chars in the admin console for searching matched documents, it does not return any though I have uploaded some documents that has Chinese chars. 

I guess the Chinese characters are not indexed. Is there any configuration I need to make in Solr?
 rgds,
canal

Re: Chinese chars are not indexed ?

Posted by Andy <an...@yahoo.com>.

What if Chinese is mixed with English?

I have text that is entered by users and it could be a mix of Chinese, English, etc.

What's the best way to handle that?

Thanks.

--- On Mon, 6/28/10, Ahmet Arslan <io...@yahoo.com> wrote:

> From: Ahmet Arslan <io...@yahoo.com>
> Subject: Re: Chinese chars are not indexed ?
> To: solr-user@lucene.apache.org
> Date: Monday, June 28, 2010, 3:44 AM
> > oh yes, *...* works. thanks.
> > 
> > I saw tokenizer is defined in schema.xml. There are a
> few
> > places that define the tokenizer. Wondering if it is
> enough
> > to define one for:
> 
> It is better to define a brand new field type specific to
> Chinese. 
> 
> http://wiki.apache.org/solr/LanguageAnalysis?highlight=%28CJKtokenizer%29#Chinese.2C_Japanese.2C_KoreanSomething
> like:
> 
> at index time:
> <tokenizer class="solr.CJKTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> 
> at query time:
> <tokenizer class="solr.CJKTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.PositionFilterFactory" />
> 
> 
> 
>       
>

Re: Chinese chars are not indexed ?

Posted by Ahmet Arslan <io...@yahoo.com>.

> oh yes, *...* works. thanks.
> 
> I saw tokenizer is defined in schema.xml. There are a few
> places that define the tokenizer. Wondering if it is enough
> to define one for:

It is better to define a brand new field type specific to Chinese. 

http://wiki.apache.org/solr/LanguageAnalysis?highlight=%28CJKtokenizer%29#Chinese.2C_Japanese.2C_KoreanSomething like:

at index time:
<tokenizer class="solr.CJKTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>

at query time:
<tokenizer class="solr.CJKTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PositionFilterFactory" />

Re: Chinese chars are not indexed ?

Posted by go canal <go...@yahoo.com>.

oh yes, *...* works. thanks.

I saw tokenizer is defined in schema.xml. There are a few places that define the tokenizer. Wondering if it is enough to define one for:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
       <!--  --------  this is the only one I need to modify ? --------- -->
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
       <!-- --------------------------------------------------------- -->
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
      </analyzer>    </fieldType>

 thanks,
canal




________________________________
From: Ahmet Arslan <io...@yahoo.com>
To: solr-user@lucene.apache.org
Sent: Mon, June 28, 2010 2:54:16 PM
Subject: Re: Chinese chars are not indexed ?

> I am using the sample, not deploying Solr in Tomcat. Is
> there a place I can modify this setting ?


Ha, okey if you are using jetty with java -jar start.jar then it is okey.
But for Chinese you need special tokenizer since Chinese is written without spaces between words.

<tokenizer class="solr.CJKTokenizerFactory"/>


Or you can search with both leading and trailing star. q=*ChineseText* should return something.

Re: Chinese chars are not indexed ?

Posted by Ahmet Arslan <io...@yahoo.com>.

> I am using the sample, not deploying Solr in Tomcat. Is
> there a place I can modify this setting ?


Ha, okey if you are using jetty with java -jar start.jar then it is okey.
But for Chinese you need special tokenizer since Chinese is written without spaces between words.

<tokenizer class="solr.CJKTokenizerFactory"/>


Or you can search with both leading and trailing star. q=*ChineseText* should return something.

Re: Chinese chars are not indexed ?

Posted by go canal <go...@yahoo.com>.

Hi,
I am using the sample, not deploying Solr in Tomcat. Is there a place I can modify this setting ?
 thanks,
canal




________________________________
From: Ahmet Arslan <io...@yahoo.com>
To: solr-user@lucene.apache.org
Sent: Sun, June 27, 2010 4:47:15 PM
Subject: Re: Chinese chars are not indexed ?

> I enter Chinese chars in the admin console for searching
> matched documents, it does not return any though I have
> uploaded some documents that has Chinese chars. 


Could it be URI Charset Config?
http://wiki.apache.org/solr/SolrTomcat#URI_Charset_Config

Re: Chinese chars are not indexed ?

Posted by Ahmet Arslan <io...@yahoo.com>.

> I enter Chinese chars in the admin console for searching
> matched documents, it does not return any though I have
> uploaded some documents that has Chinese chars. 


Could it be URI Charset Config?
http://wiki.apache.org/solr/SolrTomcat#URI_Charset_Config