You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by go canal <go...@yahoo.com> on 2010/06/27 08:47:39 UTC
Chinese chars are not indexed ?
Hello,
I enter Chinese chars in the admin console for searching matched documents, it does not return any though I have uploaded some documents that has Chinese chars.
I guess the Chinese characters are not indexed. Is there any configuration I need to make in Solr?
rgds,
canal
Re: Chinese chars are not indexed ?
Posted by Andy <an...@yahoo.com>.
What if Chinese is mixed with English?
I have text that is entered by users and it could be a mix of Chinese, English, etc.
What's the best way to handle that?
Thanks.
--- On Mon, 6/28/10, Ahmet Arslan <io...@yahoo.com> wrote:
> From: Ahmet Arslan <io...@yahoo.com>
> Subject: Re: Chinese chars are not indexed ?
> To: solr-user@lucene.apache.org
> Date: Monday, June 28, 2010, 3:44 AM
> > oh yes, *...* works. thanks.
> >
> > I saw tokenizer is defined in schema.xml. There are a
> few
> > places that define the tokenizer. Wondering if it is
> enough
> > to define one for:
>
> It is better to define a brand new field type specific to
> Chinese.
>
> http://wiki.apache.org/solr/LanguageAnalysis?highlight=%28CJKtokenizer%29#Chinese.2C_Japanese.2C_KoreanSomething
> like:
>
> at index time:
> <tokenizer class="solr.CJKTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
>
> at query time:
> <tokenizer class="solr.CJKTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.PositionFilterFactory" />
>
>
>
>
>
Re: Chinese chars are not indexed ?
Posted by Ahmet Arslan <io...@yahoo.com>.
> oh yes, *...* works. thanks.
>
> I saw tokenizer is defined in schema.xml. There are a few
> places that define the tokenizer. Wondering if it is enough
> to define one for:
It is better to define a brand new field type specific to Chinese.
http://wiki.apache.org/solr/LanguageAnalysis?highlight=%28CJKtokenizer%29#Chinese.2C_Japanese.2C_KoreanSomething like:
at index time:
<tokenizer class="solr.CJKTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
at query time:
<tokenizer class="solr.CJKTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PositionFilterFactory" />
Re: Chinese chars are not indexed ?
Posted by go canal <go...@yahoo.com>.
oh yes, *...* works. thanks.
I saw tokenizer is defined in schema.xml. There are a few places that define the tokenizer. Wondering if it is enough to define one for:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<!-- -------- this is the only one I need to modify ? --------- -->
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- --------------------------------------------------------- -->
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
</analyzer> </fieldType>
thanks,
canal
________________________________
From: Ahmet Arslan <io...@yahoo.com>
To: solr-user@lucene.apache.org
Sent: Mon, June 28, 2010 2:54:16 PM
Subject: Re: Chinese chars are not indexed ?
> I am using the sample, not deploying Solr in Tomcat. Is
> there a place I can modify this setting ?
Ha, okey if you are using jetty with java -jar start.jar then it is okey.
But for Chinese you need special tokenizer since Chinese is written without spaces between words.
<tokenizer class="solr.CJKTokenizerFactory"/>
Or you can search with both leading and trailing star. q=*ChineseText* should return something.
Re: Chinese chars are not indexed ?
Posted by Ahmet Arslan <io...@yahoo.com>.
> I am using the sample, not deploying Solr in Tomcat. Is
> there a place I can modify this setting ?
Ha, okey if you are using jetty with java -jar start.jar then it is okey.
But for Chinese you need special tokenizer since Chinese is written without spaces between words.
<tokenizer class="solr.CJKTokenizerFactory"/>
Or you can search with both leading and trailing star. q=*ChineseText* should return something.
Re: Chinese chars are not indexed ?
Posted by go canal <go...@yahoo.com>.
Hi,
I am using the sample, not deploying Solr in Tomcat. Is there a place I can modify this setting ?
thanks,
canal
________________________________
From: Ahmet Arslan <io...@yahoo.com>
To: solr-user@lucene.apache.org
Sent: Sun, June 27, 2010 4:47:15 PM
Subject: Re: Chinese chars are not indexed ?
> I enter Chinese chars in the admin console for searching
> matched documents, it does not return any though I have
> uploaded some documents that has Chinese chars.
Could it be URI Charset Config?
http://wiki.apache.org/solr/SolrTomcat#URI_Charset_Config
Re: Chinese chars are not indexed ?
Posted by Ahmet Arslan <io...@yahoo.com>.
> I enter Chinese chars in the admin console for searching
> matched documents, it does not return any though I have
> uploaded some documents that has Chinese chars.
Could it be URI Charset Config?
http://wiki.apache.org/solr/SolrTomcat#URI_Charset_Config