You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Mohammad Shariq <sh...@gmail.com> on 2011/06/08 14:34:55 UTC

how to Index and Search non-Eglish Text in solr

Hi,
I had setup solr( solr-1.4 on Ubuntu 10.10) for indexing news articles in
English, but my requirement extend to index the news of other languages too.

This is how my schema looks :
<field name="news" type="text" indexed="true" stored="false"
required="false"/>


And the "text" Field in schema.xml looks like :

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
       <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
       <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
       <filter class="solr.LowerCaseFilterFactory"/>
       <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
    </analyzer>
    <analyzer type="query">
       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
       <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
       <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="1"/>
       <filter class="solr.LowerCaseFilterFactory"/>
       <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
    </analyzer>
</fieldType>


My Problem is :
Now I want to index the news articles in other languages to e.g.
Chinese,Japnese.
How I can I modify my text field so that I can Index the news in other lang
too and make it searchable ??

Thanks
Shariq





--
View this message in context: http://lucene.472066.n3.nabble.com/how-to-Index-and-Search-non-Eglish-Text-in-solr-tp3038851p3038851.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: how to Index and Search non-Eglish Text in solr

Posted by Erick Erickson <er...@gmail.com>.

Well, no. Specifying both indexed and stored as "false"
is essentially a no-op, you'd never find anything!

But even with indexed="true", this solution has problems.
It's essentially using a single field to store text from
different languages. The problem is that tokenization,
stemming etc. behaves differently in different
languages, especially when you contrast CJK
and Western languages.

Best
Erick

On Fri, Jun 10, 2011 at 1:05 AM, Mohammad Shariq <sh...@gmail.com> wrote:
> Thanks Erick for your help.
> I have another silly question.
> Suppose I created mutiple fieldTypes e.g. news_English, news_Chinese,
> news_Japnese etc.
> after creating these field, can I copy all these to CopyField "*defaultquery"
> *like below :
>
> *<copyField source="news_English" dest="defaultquery"/>
> <copyField source="news_Chinese" dest="defaultquery"/>
> <copyField source="news_Japnese" dest="defaultquery"/>
>
> *and my "defaultquery" looks like :*
> <field name="defaultquery" type="query_text" indexed="false" stored="false"
> multiValued="true"/>
>
> *Is this right way to deal  with multiple language Indexing and searching* *
> ???*
>
> *
>
>
> On 9 June 2011 19:06, Erick Erickson <er...@gmail.com> wrote:
>
>> No, you'd have to create multiple fieldTypes, one for each language....
>>
>> Best
>> Erick
>>
>> On Thu, Jun 9, 2011 at 5:26 AM, Mohammad Shariq <sh...@gmail.com>
>> wrote:
>> > Can I specify multiple language in filter tag in schema.xml ???  like
>> below
>> >
>> > <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>> >   <analyzer type="index">
>> >      <tokenizer class="solr.
>> > WhitespaceTokenizerFactory"/>
>> >      <filter class="solr.StopFilterFactory" ignoreCase="true"
>> > words="stopwords.txt" enablePositionIncrements="true"/>
>> >      <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1"
>> > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> > catenateAll="0" splitOnCaseChange="1"/>
>> >
>> > <filter class="solr.SnowballPorterFilterFactory" language="Dutch" />
>> > <filter class="solr.SnowballPorterFilterFactory" language="English" />
>> > <filter class="solr.SnowballPorterFilterFactory" language="Chinese" />
>> > <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> > <tokenizer class="solr.CJKTokenizerFactory"/>
>> >
>> >
>> >
>> >      <filter class="solr.LowerCaseFilterFactory"/><filter
>> > class="solr.SnowballPorterFilterFactory" language="Hungarian" />
>> >
>> >
>> > On 8 June 2011 18:47, Erick Erickson <er...@gmail.com> wrote:
>> >
>> >> This page is a handy reference for individual languages...
>> >> http://wiki.apache.org/solr/LanguageAnalysis
>> >>
>> >> But the usual approach, especially for Chinese/Japanese/Korean
>> >> (CJK) is to index the content in different fields with language-specific
>> >> analyzers then spread your search across the language-specific
>> >> fields (e.g. title_en, title_fr, title_ar). Stemming and stopwords
>> >> particularly give "surprising" results if you put words from different
>> >> languages in the same field.
>> >>
>> >> Best
>> >> Erick
>> >>
>> >> On Wed, Jun 8, 2011 at 8:34 AM, Mohammad Shariq <sh...@gmail.com>
>> >> wrote:
>> >> > Hi,
>> >> > I had setup solr( solr-1.4 on Ubuntu 10.10) for indexing news articles
>> in
>> >> > English, but my requirement extend to index the news of other
>> languages
>> >> too.
>> >> >
>> >> > This is how my schema looks :
>> >> > <field name="news" type="text" indexed="true" stored="false"
>> >> > required="false"/>
>> >> >
>> >> >
>> >> > And the "text" Field in schema.xml looks like :
>> >> >
>> >> > <fieldType name="text" class="solr.TextField"
>> positionIncrementGap="100">
>> >> >    <analyzer type="index">
>> >> >       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> >> >       <filter class="solr.StopFilterFactory" ignoreCase="true"
>> >> > words="stopwords.txt" enablePositionIncrements="true"/>
>> >> >       <filter class="solr.WordDelimiterFilterFactory"
>> >> generateWordParts="1"
>> >> > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> >> > catenateAll="0" splitOnCaseChange="1"/>
>> >> >       <filter class="solr.LowerCaseFilterFactory"/>
>> >> >       <filter class="solr.SnowballPorterFilterFactory"
>> language="English"
>> >> > protected="protwords.txt"/>
>> >> >    </analyzer>
>> >> >    <analyzer type="query">
>> >> >       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> >> >       <filter class="solr.SynonymFilterFactory"
>> synonyms="synonyms.txt"
>> >> > ignoreCase="true" expand="true"/>
>> >> >       <filter class="solr.StopFilterFactory" ignoreCase="true"
>> >> > words="stopwords.txt" enablePositionIncrements="true"/>
>> >> >       <filter class="solr.WordDelimiterFilterFactory"
>> >> generateWordParts="1"
>> >> > generateNumberParts="1" catenateWords="0" catenateNumbers="0"
>> >> > catenateAll="0" splitOnCaseChange="1"/>
>> >> >       <filter class="solr.LowerCaseFilterFactory"/>
>> >> >       <filter class="solr.SnowballPorterFilterFactory"
>> language="English"
>> >> > protected="protwords.txt"/>
>> >> >    </analyzer>
>> >> > </fieldType>
>> >> >
>> >> >
>> >> > My Problem is :
>> >> > Now I want to index the news articles in other languages to e.g.
>> >> > Chinese,Japnese.
>> >> > How I can I modify my text field so that I can Index the news in other
>> >> lang
>> >> > too and make it searchable ??
>> >> >
>> >> > Thanks
>> >> > Shariq
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > View this message in context:
>> >>
>> http://lucene.472066.n3.nabble.com/how-to-Index-and-Search-non-Eglish-Text-in-solr-tp3038851p3038851.html
>> >> > Sent from the Solr - User mailing list archive at Nabble.com.
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Thanks and Regards
>> > Mohammad Shariq
>> >
>>
>
>
>
> --
> Thanks and Regards
> Mohammad Shariq
>

Re: how to Index and Search non-Eglish Text in solr

Posted by Mohammad Shariq <sh...@gmail.com>.

Thanks Erick for your help.
I have another silly question.
Suppose I created mutiple fieldTypes e.g. news_English, news_Chinese,
news_Japnese etc.
after creating these field, can I copy all these to CopyField "*defaultquery"
*like below :

*<copyField source="news_English" dest="defaultquery"/>
<copyField source="news_Chinese" dest="defaultquery"/>
<copyField source="news_Japnese" dest="defaultquery"/>

*and my "defaultquery" looks like :*
<field name="defaultquery" type="query_text" indexed="false" stored="false"
multiValued="true"/>

*Is this right way to deal  with multiple language Indexing and searching* *
???*

*


On 9 June 2011 19:06, Erick Erickson <er...@gmail.com> wrote:

> No, you'd have to create multiple fieldTypes, one for each language....
>
> Best
> Erick
>
> On Thu, Jun 9, 2011 at 5:26 AM, Mohammad Shariq <sh...@gmail.com>
> wrote:
> > Can I specify multiple language in filter tag in schema.xml ???  like
> below
> >
> > <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
> >   <analyzer type="index">
> >      <tokenizer class="solr.
> > WhitespaceTokenizerFactory"/>
> >      <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt" enablePositionIncrements="true"/>
> >      <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1"
> > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> > catenateAll="0" splitOnCaseChange="1"/>
> >
> > <filter class="solr.SnowballPorterFilterFactory" language="Dutch" />
> > <filter class="solr.SnowballPorterFilterFactory" language="English" />
> > <filter class="solr.SnowballPorterFilterFactory" language="Chinese" />
> > <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> > <tokenizer class="solr.CJKTokenizerFactory"/>
> >
> >
> >
> >      <filter class="solr.LowerCaseFilterFactory"/><filter
> > class="solr.SnowballPorterFilterFactory" language="Hungarian" />
> >
> >
> > On 8 June 2011 18:47, Erick Erickson <er...@gmail.com> wrote:
> >
> >> This page is a handy reference for individual languages...
> >> http://wiki.apache.org/solr/LanguageAnalysis
> >>
> >> But the usual approach, especially for Chinese/Japanese/Korean
> >> (CJK) is to index the content in different fields with language-specific
> >> analyzers then spread your search across the language-specific
> >> fields (e.g. title_en, title_fr, title_ar). Stemming and stopwords
> >> particularly give "surprising" results if you put words from different
> >> languages in the same field.
> >>
> >> Best
> >> Erick
> >>
> >> On Wed, Jun 8, 2011 at 8:34 AM, Mohammad Shariq <sh...@gmail.com>
> >> wrote:
> >> > Hi,
> >> > I had setup solr( solr-1.4 on Ubuntu 10.10) for indexing news articles
> in
> >> > English, but my requirement extend to index the news of other
> languages
> >> too.
> >> >
> >> > This is how my schema looks :
> >> > <field name="news" type="text" indexed="true" stored="false"
> >> > required="false"/>
> >> >
> >> >
> >> > And the "text" Field in schema.xml looks like :
> >> >
> >> > <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
> >> >    <analyzer type="index">
> >> >       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >> >       <filter class="solr.StopFilterFactory" ignoreCase="true"
> >> > words="stopwords.txt" enablePositionIncrements="true"/>
> >> >       <filter class="solr.WordDelimiterFilterFactory"
> >> generateWordParts="1"
> >> > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> >> > catenateAll="0" splitOnCaseChange="1"/>
> >> >       <filter class="solr.LowerCaseFilterFactory"/>
> >> >       <filter class="solr.SnowballPorterFilterFactory"
> language="English"
> >> > protected="protwords.txt"/>
> >> >    </analyzer>
> >> >    <analyzer type="query">
> >> >       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >> >       <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt"
> >> > ignoreCase="true" expand="true"/>
> >> >       <filter class="solr.StopFilterFactory" ignoreCase="true"
> >> > words="stopwords.txt" enablePositionIncrements="true"/>
> >> >       <filter class="solr.WordDelimiterFilterFactory"
> >> generateWordParts="1"
> >> > generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> >> > catenateAll="0" splitOnCaseChange="1"/>
> >> >       <filter class="solr.LowerCaseFilterFactory"/>
> >> >       <filter class="solr.SnowballPorterFilterFactory"
> language="English"
> >> > protected="protwords.txt"/>
> >> >    </analyzer>
> >> > </fieldType>
> >> >
> >> >
> >> > My Problem is :
> >> > Now I want to index the news articles in other languages to e.g.
> >> > Chinese,Japnese.
> >> > How I can I modify my text field so that I can Index the news in other
> >> lang
> >> > too and make it searchable ??
> >> >
> >> > Thanks
> >> > Shariq
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/how-to-Index-and-Search-non-Eglish-Text-in-solr-tp3038851p3038851.html
> >> > Sent from the Solr - User mailing list archive at Nabble.com.
> >> >
> >>
> >
> >
> >
> > --
> > Thanks and Regards
> > Mohammad Shariq
> >
>



-- 
Thanks and Regards
Mohammad Shariq

Re: how to Index and Search non-Eglish Text in solr

Posted by Erick Erickson <er...@gmail.com>.

No, you'd have to create multiple fieldTypes, one for each language....

Best
Erick

On Thu, Jun 9, 2011 at 5:26 AM, Mohammad Shariq <sh...@gmail.com> wrote:
> Can I specify multiple language in filter tag in schema.xml ???  like below
>
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>   <analyzer type="index">
>      <tokenizer class="solr.
> WhitespaceTokenizerFactory"/>
>      <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"/>
>      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
>
> <filter class="solr.SnowballPorterFilterFactory" language="Dutch" />
> <filter class="solr.SnowballPorterFilterFactory" language="English" />
> <filter class="solr.SnowballPorterFilterFactory" language="Chinese" />
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <tokenizer class="solr.CJKTokenizerFactory"/>
>
>
>
>      <filter class="solr.LowerCaseFilterFactory"/><filter
> class="solr.SnowballPorterFilterFactory" language="Hungarian" />
>
>
> On 8 June 2011 18:47, Erick Erickson <er...@gmail.com> wrote:
>
>> This page is a handy reference for individual languages...
>> http://wiki.apache.org/solr/LanguageAnalysis
>>
>> But the usual approach, especially for Chinese/Japanese/Korean
>> (CJK) is to index the content in different fields with language-specific
>> analyzers then spread your search across the language-specific
>> fields (e.g. title_en, title_fr, title_ar). Stemming and stopwords
>> particularly give "surprising" results if you put words from different
>> languages in the same field.
>>
>> Best
>> Erick
>>
>> On Wed, Jun 8, 2011 at 8:34 AM, Mohammad Shariq <sh...@gmail.com>
>> wrote:
>> > Hi,
>> > I had setup solr( solr-1.4 on Ubuntu 10.10) for indexing news articles in
>> > English, but my requirement extend to index the news of other languages
>> too.
>> >
>> > This is how my schema looks :
>> > <field name="news" type="text" indexed="true" stored="false"
>> > required="false"/>
>> >
>> >
>> > And the "text" Field in schema.xml looks like :
>> >
>> > <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>> >    <analyzer type="index">
>> >       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> >       <filter class="solr.StopFilterFactory" ignoreCase="true"
>> > words="stopwords.txt" enablePositionIncrements="true"/>
>> >       <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1"
>> > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
>> > catenateAll="0" splitOnCaseChange="1"/>
>> >       <filter class="solr.LowerCaseFilterFactory"/>
>> >       <filter class="solr.SnowballPorterFilterFactory" language="English"
>> > protected="protwords.txt"/>
>> >    </analyzer>
>> >    <analyzer type="query">
>> >       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> >       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> > ignoreCase="true" expand="true"/>
>> >       <filter class="solr.StopFilterFactory" ignoreCase="true"
>> > words="stopwords.txt" enablePositionIncrements="true"/>
>> >       <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1"
>> > generateNumberParts="1" catenateWords="0" catenateNumbers="0"
>> > catenateAll="0" splitOnCaseChange="1"/>
>> >       <filter class="solr.LowerCaseFilterFactory"/>
>> >       <filter class="solr.SnowballPorterFilterFactory" language="English"
>> > protected="protwords.txt"/>
>> >    </analyzer>
>> > </fieldType>
>> >
>> >
>> > My Problem is :
>> > Now I want to index the news articles in other languages to e.g.
>> > Chinese,Japnese.
>> > How I can I modify my text field so that I can Index the news in other
>> lang
>> > too and make it searchable ??
>> >
>> > Thanks
>> > Shariq
>> >
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://lucene.472066.n3.nabble.com/how-to-Index-and-Search-non-Eglish-Text-in-solr-tp3038851p3038851.html
>> > Sent from the Solr - User mailing list archive at Nabble.com.
>> >
>>
>
>
>
> --
> Thanks and Regards
> Mohammad Shariq
>

Re: how to Index and Search non-Eglish Text in solr

Posted by Mohammad Shariq <sh...@gmail.com>.

Can I specify multiple language in filter tag in schema.xml ???  like below

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
   <analyzer type="index">
      <tokenizer class="solr.
WhitespaceTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="true"/>
      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>

<filter class="solr.SnowballPorterFilterFactory" language="Dutch" />
<filter class="solr.SnowballPorterFilterFactory" language="English" />
<filter class="solr.SnowballPorterFilterFactory" language="Chinese" />
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<tokenizer class="solr.CJKTokenizerFactory"/>



      <filter class="solr.LowerCaseFilterFactory"/><filter
class="solr.SnowballPorterFilterFactory" language="Hungarian" />


On 8 June 2011 18:47, Erick Erickson <er...@gmail.com> wrote:

> This page is a handy reference for individual languages...
> http://wiki.apache.org/solr/LanguageAnalysis
>
> But the usual approach, especially for Chinese/Japanese/Korean
> (CJK) is to index the content in different fields with language-specific
> analyzers then spread your search across the language-specific
> fields (e.g. title_en, title_fr, title_ar). Stemming and stopwords
> particularly give "surprising" results if you put words from different
> languages in the same field.
>
> Best
> Erick
>
> On Wed, Jun 8, 2011 at 8:34 AM, Mohammad Shariq <sh...@gmail.com>
> wrote:
> > Hi,
> > I had setup solr( solr-1.4 on Ubuntu 10.10) for indexing news articles in
> > English, but my requirement extend to index the news of other languages
> too.
> >
> > This is how my schema looks :
> > <field name="news" type="text" indexed="true" stored="false"
> > required="false"/>
> >
> >
> > And the "text" Field in schema.xml looks like :
> >
> > <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
> >    <analyzer type="index">
> >       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >       <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt" enablePositionIncrements="true"/>
> >       <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1"
> > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> > catenateAll="0" splitOnCaseChange="1"/>
> >       <filter class="solr.LowerCaseFilterFactory"/>
> >       <filter class="solr.SnowballPorterFilterFactory" language="English"
> > protected="protwords.txt"/>
> >    </analyzer>
> >    <analyzer type="query">
> >       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> > ignoreCase="true" expand="true"/>
> >       <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt" enablePositionIncrements="true"/>
> >       <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1"
> > generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> > catenateAll="0" splitOnCaseChange="1"/>
> >       <filter class="solr.LowerCaseFilterFactory"/>
> >       <filter class="solr.SnowballPorterFilterFactory" language="English"
> > protected="protwords.txt"/>
> >    </analyzer>
> > </fieldType>
> >
> >
> > My Problem is :
> > Now I want to index the news articles in other languages to e.g.
> > Chinese,Japnese.
> > How I can I modify my text field so that I can Index the news in other
> lang
> > too and make it searchable ??
> >
> > Thanks
> > Shariq
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/how-to-Index-and-Search-non-Eglish-Text-in-solr-tp3038851p3038851.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>



-- 
Thanks and Regards
Mohammad Shariq

Re: how to Index and Search non-Eglish Text in solr

Posted by Erick Erickson <er...@gmail.com>.

This page is a handy reference for individual languages...
http://wiki.apache.org/solr/LanguageAnalysis

But the usual approach, especially for Chinese/Japanese/Korean
(CJK) is to index the content in different fields with language-specific
analyzers then spread your search across the language-specific
fields (e.g. title_en, title_fr, title_ar). Stemming and stopwords
particularly give "surprising" results if you put words from different
languages in the same field.

Best
Erick

On Wed, Jun 8, 2011 at 8:34 AM, Mohammad Shariq <sh...@gmail.com> wrote:
> Hi,
> I had setup solr( solr-1.4 on Ubuntu 10.10) for indexing news articles in
> English, but my requirement extend to index the news of other languages too.
>
> This is how my schema looks :
> <field name="news" type="text" indexed="true" stored="false"
> required="false"/>
>
>
> And the "text" Field in schema.xml looks like :
>
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>    <analyzer type="index">
>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"/>
>       <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>       <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
>    </analyzer>
>    <analyzer type="query">
>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>       <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"/>
>       <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="1"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>       <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
>    </analyzer>
> </fieldType>
>
>
> My Problem is :
> Now I want to index the news articles in other languages to e.g.
> Chinese,Japnese.
> How I can I modify my text field so that I can Index the news in other lang
> too and make it searchable ??
>
> Thanks
> Shariq
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/how-to-Index-and-Search-non-Eglish-Text-in-solr-tp3038851p3038851.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>