You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by bing <JS...@hotmail.com> on 2012/02/14 07:34:55 UTC
Language specific tokenizer for purpose of multilingual search in
single-core solr,
Hi, all,
I want to do multilingual search in single-core solr. That requires to
define language specific tokenizers in scheme.xml. Say for example, I have
two tokenizers, one for English ("en") and one for simplified Chinese
("zh-cn"). Can I just put following definitions together in one schema.xml,
and both sets of the files ( stopwords, synonym, and protwords) in one
directory?
1. fieldType and field definition for english ("en")
<fieldType name="text_en" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index" language="en">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_en.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory"
protected="protwords_en.txt"/>
</analyzer>
.....
</fieldType>
<field name="text_en" type="text_en" indexed="true" stored="false"
multiValued="true"/>
2. fieldType and field definition for Chinese ("zh_cn")
<fieldType name="text_zh_ch" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index" language="zh_cn">
<tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory"/>/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_ch.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory"
protected="protwords_en.txt"/>
</analyzer>
.....
</fieldType>
<field name="text_zh_cn" type="text_zh_cn" indexed="true" stored="false"
multiValued="true"/>
Best
Bing
--
View this message in context: http://lucene.472066.n3.nabble.com/Language-specific-tokenizer-for-purpose-of-multilingual-search-in-single-core-solr-tp3742873p3742873.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Language specific tokenizer for purpose of multilingual search
in single-core solr,
Posted by Chris Hostetter <ho...@fucit.org>.
: I want to do multilingual search in single-core solr. That requires to
: define language specific tokenizers in scheme.xml. Say for example, I have
: two tokenizers, one for English ("en") and one for simplified Chinese
: ("zh-cn"). Can I just put following definitions together in one schema.xml,
: and both sets of the files ( stopwords, synonym, and protwords) in one
: directory?
absolutely.
-Hoss
Re: Language specific tokenizer for purpose of multilingual search in single-core solr,
Posted by Paul Libbrecht <pa...@hoplahup.net>.
only one field element?
There should be two or?
One for each language.
paul
Le 14 févr. 2012 à 07:34, bing a écrit :
>
> Hi, all,
>
> I want to do multilingual search in single-core solr. That requires to
> define language specific tokenizers in scheme.xml. Say for example, I have
> two tokenizers, one for English ("en") and one for simplified Chinese
> ("zh-cn"). Can I just put following definitions together in one schema.xml,
> and both sets of the files ( stopwords, synonym, and protwords) in one
> directory?
>
>
> 1. fieldType and field definition for english ("en")
>
> <fieldType name="text_en" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index" language="en">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_en.txt" enablePositionIncrements="true" />
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.SnowballPorterFilterFactory"
> protected="protwords_en.txt"/>
> </analyzer>
> .....
> </fieldType>
>
> <field name="text_en" type="text_en" indexed="true" stored="false"
> multiValued="true"/>
>
>
> 2. fieldType and field definition for Chinese ("zh_cn")
>
> <fieldType name="text_zh_ch" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index" language="zh_cn">
> <tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory"/>/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_ch.txt" enablePositionIncrements="true" />
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.SnowballPorterFilterFactory"
> protected="protwords_en.txt"/>
> </analyzer>
> .....
> </fieldType>
>
> <field name="text_zh_cn" type="text_zh_cn" indexed="true" stored="false"
> multiValued="true"/>
>
>
> Best
> Bing
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Language-specific-tokenizer-for-purpose-of-multilingual-search-in-single-core-solr-tp3742873p3742873.html
> Sent from the Solr - User mailing list archive at Nabble.com.