You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Daniel Persson <ma...@gmail.com> on 2010/01/11 13:45:56 UTC

Multi language support

Hi Solr users.

I'm trying to set up a site with Solr search integrated. And I use the
SolJava API to feed the index with search documents. At the moment I
have only activated search on the English portion of the site. I'm
interested in using as many features of solr as possible. Synonyms,
Stopwords and stems all sounds quite interesting and useful but how do
I set up this in a good way for a multilingual site?

The site don't have a huge text mass so performance issues don't
really bother me but still I'd like to hear your suggestions before I
try to implement an solution.

Best regards

Daniel

Re: Multi language support

Posted by Robert Muir <rc...@gmail.com>.
I don't think this is something to consider across the board for all
languages. The same grammatical units that are part of a word in one
language (and removed by stemmers) are independent morphemes in others
(and should be stopwords)

so please take this advice on a case-by-case basis for each language.

On Tue, Jan 12, 2010 at 9:20 PM, Lance Norskog <go...@gmail.com> wrote:
> There are a lot of projects that don't use stopwords any more. You
> might consider dropping them altogether.
>
> On Mon, Jan 11, 2010 at 2:25 PM, Don Werve <do...@madwombat.com> wrote:
>> This is the way I've implemented multilingual search as well.
>>
>> 2010/1/11 Markus Jelsma <ma...@buyways.nl>
>>
>>> Hello,
>>>
>>>
>>> We have implemented language specific search in Solr using language
>>> specific fields and field types. For instance, an en_text field type can
>>> use an English stemmer, and list of stopwords and synonyms. We, however
>>> did not use specific stopwords, instead we used one list shared by both
>>> languages.
>>>
>>> So you would have a field type like:
>>> <fieldType name="en_text" class="solr.TextField" ...
>>>  <analyzer type="">
>>>  <filter class="solr.StopFilterFactory" words="stopwords.en.txt">
>>>  <filter class="solr.SynonymFilterFactory" synonyms="synoyms.en.txt">
>>>
>>> etc etc.
>>>
>>>
>>>
>>> Cheers,
>>>
>>> -
>>> Markus Jelsma          Buyways B.V.
>>> Technisch Architect    Friesestraatweg 215c
>>> http://www.buyways.nl  9743 AD Groningen
>>>
>>>
>>> Alg. 050-853 6600      KvK  01074105
>>> Tel. 050-853 6620      Fax. 050-3118124
>>> Mob. 06-5025 8350      In: http://www.linkedin.com/in/markus17
>>>
>>>
>>> On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote:
>>>
>>> > Hi Solr users.
>>> >
>>> > I'm trying to set up a site with Solr search integrated. And I use the
>>> > SolJava API to feed the index with search documents. At the moment I
>>> > have only activated search on the English portion of the site. I'm
>>> > interested in using as many features of solr as possible. Synonyms,
>>> > Stopwords and stems all sounds quite interesting and useful but how do
>>> > I set up this in a good way for a multilingual site?
>>> >
>>> > The site don't have a huge text mass so performance issues don't
>>> > really bother me but still I'd like to hear your suggestions before I
>>> > try to implement an solution.
>>> >
>>> > Best regards
>>> >
>>> > Daniel
>>>
>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>



-- 
Robert Muir
rcmuir@gmail.com

Re: Multi language support

Posted by Robert Muir <rc...@gmail.com>.
right, but we should not encourage users to significantly degrade
overall relevance for all movies due to a few movies and a band (very
special cases, as I said).

In english, by not using stopwords, it doesn't really degrade
relevance that much, so its a reasonable decision to make. This is not
true in other languages!

Instead, systems that worry about all-stopword queries should use
CommonGrams. it will work better for these cases, without taking away
from overall relevance.

On Wed, Jan 13, 2010 at 1:08 AM, Walter Underwood <wu...@wunderwood.org> wrote:
> There is a band named "The The". And a producer named "Don Was". For a list of all-stopword movie titles at Netflix, see this post:
>
> http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html
>
> My favorite is "To Be and To Have (Être et Avoir)", which is all stopwords in two languages. And a very good movie.
>
> wunder
>
> On Jan 12, 2010, at 6:55 PM, Robert Muir wrote:
>
>> sorry, i forgot to include this 2009 paper comparing what stopwords do
>> across 3 languages:
>>
>> http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf
>>
>> in my opinion, if stopwords annoy your users for very special cases
>> like 'the the' then, instead consider using commongrams +
>> defaultsimilarity.discountOverlaps = true so that you still get the
>> benefits.
>>
>> as you can see from the above paper, they can be extremely important
>> depending on the language, they just don't matter so much for English.
>>
>> On Tue, Jan 12, 2010 at 9:20 PM, Lance Norskog <go...@gmail.com> wrote:
>>> There are a lot of projects that don't use stopwords any more. You
>>> might consider dropping them altogether.
>>>
>>> On Mon, Jan 11, 2010 at 2:25 PM, Don Werve <do...@madwombat.com> wrote:
>>>> This is the way I've implemented multilingual search as well.
>>>>
>>>> 2010/1/11 Markus Jelsma <ma...@buyways.nl>
>>>>
>>>>> Hello,
>>>>>
>>>>>
>>>>> We have implemented language specific search in Solr using language
>>>>> specific fields and field types. For instance, an en_text field type can
>>>>> use an English stemmer, and list of stopwords and synonyms. We, however
>>>>> did not use specific stopwords, instead we used one list shared by both
>>>>> languages.
>>>>>
>>>>> So you would have a field type like:
>>>>> <fieldType name="en_text" class="solr.TextField" ...
>>>>>  <analyzer type="">
>>>>>  <filter class="solr.StopFilterFactory" words="stopwords.en.txt">
>>>>>  <filter class="solr.SynonymFilterFactory" synonyms="synoyms.en.txt">
>>>>>
>>>>> etc etc.
>>>>>
>>>>>
>>>>>
>>>>> Cheers,
>>>>>
>>>>> -
>>>>> Markus Jelsma          Buyways B.V.
>>>>> Technisch Architect    Friesestraatweg 215c
>>>>> http://www.buyways.nl  9743 AD Groningen
>>>>>
>>>>>
>>>>> Alg. 050-853 6600      KvK  01074105
>>>>> Tel. 050-853 6620      Fax. 050-3118124
>>>>> Mob. 06-5025 8350      In: http://www.linkedin.com/in/markus17
>>>>>
>>>>>
>>>>> On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote:
>>>>>
>>>>>> Hi Solr users.
>>>>>>
>>>>>> I'm trying to set up a site with Solr search integrated. And I use the
>>>>>> SolJava API to feed the index with search documents. At the moment I
>>>>>> have only activated search on the English portion of the site. I'm
>>>>>> interested in using as many features of solr as possible. Synonyms,
>>>>>> Stopwords and stems all sounds quite interesting and useful but how do
>>>>>> I set up this in a good way for a multilingual site?
>>>>>>
>>>>>> The site don't have a huge text mass so performance issues don't
>>>>>> really bother me but still I'd like to hear your suggestions before I
>>>>>> try to implement an solution.
>>>>>>
>>>>>> Best regards
>>>>>>
>>>>>> Daniel
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> goksron@gmail.com
>>>
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>
>



-- 
Robert Muir
rcmuir@gmail.com

Re: Multi language support

Posted by Lance Norskog <go...@gmail.com>.
Robert Muir: Thank you for the pointer to that paper!

On Wed, Jan 13, 2010 at 6:29 AM, Paul Libbrecht <pa...@activemath.org> wrote:
> Isn't the conclusion here that some "stopword and stemming free matching"
> should be the best match if ever and to then gently degrade to  weaker forms
> of matching?
>
> paul
>
>
> Le 13-janv.-10 à 07:08, Walter Underwood a écrit :
>
>> There is a band named "The The". And a producer named "Don Was". For a
>> list of all-stopword movie titles at Netflix, see this post:
>>
>> http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html
>>
>> My favorite is "To Be and To Have (Être et Avoir)", which is all stopwords
>> in two languages. And a very good movie.
>>
>> wunder
>>
>> On Jan 12, 2010, at 6:55 PM, Robert Muir wrote:
>>
>>> sorry, i forgot to include this 2009 paper comparing what stopwords do
>>> across 3 languages:
>>>
>>>
>>> http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf
>>>
>>> in my opinion, if stopwords annoy your users for very special cases
>>> like 'the the' then, instead consider using commongrams +
>>> defaultsimilarity.discountOverlaps = true so that you still get the
>>> benefits.
>>>
>>> as you can see from the above paper, they can be extremely important
>>> depending on the language, they just don't matter so much for English.
>>>
>>> On Tue, Jan 12, 2010 at 9:20 PM, Lance Norskog <go...@gmail.com> wrote:
>>>>
>>>> There are a lot of projects that don't use stopwords any more. You
>>>> might consider dropping them altogether.
>>>>
>>>> On Mon, Jan 11, 2010 at 2:25 PM, Don Werve <do...@madwombat.com> wrote:
>>>>>
>>>>> This is the way I've implemented multilingual search as well.
>>>>>
>>>>> 2010/1/11 Markus Jelsma <ma...@buyways.nl>
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>>
>>>>>> We have implemented language specific search in Solr using language
>>>>>> specific fields and field types. For instance, an en_text field type
>>>>>> can
>>>>>> use an English stemmer, and list of stopwords and synonyms. We,
>>>>>> however
>>>>>> did not use specific stopwords, instead we used one list shared by
>>>>>> both
>>>>>> languages.
>>>>>>
>>>>>> So you would have a field type like:
>>>>>> <fieldType name="en_text" class="solr.TextField" ...
>>>>>> <analyzer type="">
>>>>>> <filter class="solr.StopFilterFactory" words="stopwords.en.txt">
>>>>>> <filter class="solr.SynonymFilterFactory" synonyms="synoyms.en.txt">
>>>>>>
>>>>>> etc etc.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> -
>>>>>> Markus Jelsma          Buyways B.V.
>>>>>> Technisch Architect    Friesestraatweg 215c
>>>>>> http://www.buyways.nl  9743 AD Groningen
>>>>>>
>>>>>>
>>>>>> Alg. 050-853 6600      KvK  01074105
>>>>>> Tel. 050-853 6620      Fax. 050-3118124
>>>>>> Mob. 06-5025 8350      In: http://www.linkedin.com/in/markus17
>>>>>>
>>>>>>
>>>>>> On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote:
>>>>>>
>>>>>>> Hi Solr users.
>>>>>>>
>>>>>>> I'm trying to set up a site with Solr search integrated. And I use
>>>>>>> the
>>>>>>> SolJava API to feed the index with search documents. At the moment I
>>>>>>> have only activated search on the English portion of the site. I'm
>>>>>>> interested in using as many features of solr as possible. Synonyms,
>>>>>>> Stopwords and stems all sounds quite interesting and useful but how
>>>>>>> do
>>>>>>> I set up this in a good way for a multilingual site?
>>>>>>>
>>>>>>> The site don't have a huge text mass so performance issues don't
>>>>>>> really bother me but still I'd like to hear your suggestions before I
>>>>>>> try to implement an solution.
>>>>>>>
>>>>>>> Best regards
>>>>>>>
>>>>>>> Daniel
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Lance Norskog
>>>> goksron@gmail.com
>>>>
>>>
>>>
>>>
>>> --
>>> Robert Muir
>>> rcmuir@gmail.com
>>>
>>
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Multi language support

Posted by Paul Libbrecht <pa...@activemath.org>.
Isn't the conclusion here that some "stopword and stemming free  
matching" should be the best match if ever and to then gently degrade  
to  weaker forms of matching?

paul


Le 13-janv.-10 à 07:08, Walter Underwood a écrit :

> There is a band named "The The". And a producer named "Don Was". For  
> a list of all-stopword movie titles at Netflix, see this post:
>
> http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html
>
> My favorite is "To Be and To Have (Être et Avoir)", which is all  
> stopwords in two languages. And a very good movie.
>
> wunder
>
> On Jan 12, 2010, at 6:55 PM, Robert Muir wrote:
>
>> sorry, i forgot to include this 2009 paper comparing what stopwords  
>> do
>> across 3 languages:
>>
>> http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf
>>
>> in my opinion, if stopwords annoy your users for very special cases
>> like 'the the' then, instead consider using commongrams +
>> defaultsimilarity.discountOverlaps = true so that you still get the
>> benefits.
>>
>> as you can see from the above paper, they can be extremely important
>> depending on the language, they just don't matter so much for  
>> English.
>>
>> On Tue, Jan 12, 2010 at 9:20 PM, Lance Norskog <go...@gmail.com>  
>> wrote:
>>> There are a lot of projects that don't use stopwords any more. You
>>> might consider dropping them altogether.
>>>
>>> On Mon, Jan 11, 2010 at 2:25 PM, Don Werve <do...@madwombat.com>  
>>> wrote:
>>>> This is the way I've implemented multilingual search as well.
>>>>
>>>> 2010/1/11 Markus Jelsma <ma...@buyways.nl>
>>>>
>>>>> Hello,
>>>>>
>>>>>
>>>>> We have implemented language specific search in Solr using  
>>>>> language
>>>>> specific fields and field types. For instance, an en_text field  
>>>>> type can
>>>>> use an English stemmer, and list of stopwords and synonyms. We,  
>>>>> however
>>>>> did not use specific stopwords, instead we used one list shared  
>>>>> by both
>>>>> languages.
>>>>>
>>>>> So you would have a field type like:
>>>>> <fieldType name="en_text" class="solr.TextField" ...
>>>>> <analyzer type="">
>>>>> <filter class="solr.StopFilterFactory" words="stopwords.en.txt">
>>>>> <filter class="solr.SynonymFilterFactory"  
>>>>> synonyms="synoyms.en.txt">
>>>>>
>>>>> etc etc.
>>>>>
>>>>>
>>>>>
>>>>> Cheers,
>>>>>
>>>>> -
>>>>> Markus Jelsma          Buyways B.V.
>>>>> Technisch Architect    Friesestraatweg 215c
>>>>> http://www.buyways.nl  9743 AD Groningen
>>>>>
>>>>>
>>>>> Alg. 050-853 6600      KvK  01074105
>>>>> Tel. 050-853 6620      Fax. 050-3118124
>>>>> Mob. 06-5025 8350      In: http://www.linkedin.com/in/markus17
>>>>>
>>>>>
>>>>> On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote:
>>>>>
>>>>>> Hi Solr users.
>>>>>>
>>>>>> I'm trying to set up a site with Solr search integrated. And I  
>>>>>> use the
>>>>>> SolJava API to feed the index with search documents. At the  
>>>>>> moment I
>>>>>> have only activated search on the English portion of the site.  
>>>>>> I'm
>>>>>> interested in using as many features of solr as possible.  
>>>>>> Synonyms,
>>>>>> Stopwords and stems all sounds quite interesting and useful but  
>>>>>> how do
>>>>>> I set up this in a good way for a multilingual site?
>>>>>>
>>>>>> The site don't have a huge text mass so performance issues don't
>>>>>> really bother me but still I'd like to hear your suggestions  
>>>>>> before I
>>>>>> try to implement an solution.
>>>>>>
>>>>>> Best regards
>>>>>>
>>>>>> Daniel
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> goksron@gmail.com
>>>
>>
>>
>>
>> -- 
>> Robert Muir
>> rcmuir@gmail.com
>>
>


Re: Multi language support

Posted by Walter Underwood <wu...@wunderwood.org>.
There is a band named "The The". And a producer named "Don Was". For a list of all-stopword movie titles at Netflix, see this post:

http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html

My favorite is "To Be and To Have (Être et Avoir)", which is all stopwords in two languages. And a very good movie.

wunder

On Jan 12, 2010, at 6:55 PM, Robert Muir wrote:

> sorry, i forgot to include this 2009 paper comparing what stopwords do
> across 3 languages:
> 
> http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf
> 
> in my opinion, if stopwords annoy your users for very special cases
> like 'the the' then, instead consider using commongrams +
> defaultsimilarity.discountOverlaps = true so that you still get the
> benefits.
> 
> as you can see from the above paper, they can be extremely important
> depending on the language, they just don't matter so much for English.
> 
> On Tue, Jan 12, 2010 at 9:20 PM, Lance Norskog <go...@gmail.com> wrote:
>> There are a lot of projects that don't use stopwords any more. You
>> might consider dropping them altogether.
>> 
>> On Mon, Jan 11, 2010 at 2:25 PM, Don Werve <do...@madwombat.com> wrote:
>>> This is the way I've implemented multilingual search as well.
>>> 
>>> 2010/1/11 Markus Jelsma <ma...@buyways.nl>
>>> 
>>>> Hello,
>>>> 
>>>> 
>>>> We have implemented language specific search in Solr using language
>>>> specific fields and field types. For instance, an en_text field type can
>>>> use an English stemmer, and list of stopwords and synonyms. We, however
>>>> did not use specific stopwords, instead we used one list shared by both
>>>> languages.
>>>> 
>>>> So you would have a field type like:
>>>> <fieldType name="en_text" class="solr.TextField" ...
>>>>  <analyzer type="">
>>>>  <filter class="solr.StopFilterFactory" words="stopwords.en.txt">
>>>>  <filter class="solr.SynonymFilterFactory" synonyms="synoyms.en.txt">
>>>> 
>>>> etc etc.
>>>> 
>>>> 
>>>> 
>>>> Cheers,
>>>> 
>>>> -
>>>> Markus Jelsma          Buyways B.V.
>>>> Technisch Architect    Friesestraatweg 215c
>>>> http://www.buyways.nl  9743 AD Groningen
>>>> 
>>>> 
>>>> Alg. 050-853 6600      KvK  01074105
>>>> Tel. 050-853 6620      Fax. 050-3118124
>>>> Mob. 06-5025 8350      In: http://www.linkedin.com/in/markus17
>>>> 
>>>> 
>>>> On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote:
>>>> 
>>>>> Hi Solr users.
>>>>> 
>>>>> I'm trying to set up a site with Solr search integrated. And I use the
>>>>> SolJava API to feed the index with search documents. At the moment I
>>>>> have only activated search on the English portion of the site. I'm
>>>>> interested in using as many features of solr as possible. Synonyms,
>>>>> Stopwords and stems all sounds quite interesting and useful but how do
>>>>> I set up this in a good way for a multilingual site?
>>>>> 
>>>>> The site don't have a huge text mass so performance issues don't
>>>>> really bother me but still I'd like to hear your suggestions before I
>>>>> try to implement an solution.
>>>>> 
>>>>> Best regards
>>>>> 
>>>>> Daniel
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> Lance Norskog
>> goksron@gmail.com
>> 
> 
> 
> 
> -- 
> Robert Muir
> rcmuir@gmail.com
> 


Re: Multi language support

Posted by Robert Muir <rc...@gmail.com>.
sorry, i forgot to include this 2009 paper comparing what stopwords do
across 3 languages:

http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf

in my opinion, if stopwords annoy your users for very special cases
like 'the the' then, instead consider using commongrams +
defaultsimilarity.discountOverlaps = true so that you still get the
benefits.

as you can see from the above paper, they can be extremely important
depending on the language, they just don't matter so much for English.

On Tue, Jan 12, 2010 at 9:20 PM, Lance Norskog <go...@gmail.com> wrote:
> There are a lot of projects that don't use stopwords any more. You
> might consider dropping them altogether.
>
> On Mon, Jan 11, 2010 at 2:25 PM, Don Werve <do...@madwombat.com> wrote:
>> This is the way I've implemented multilingual search as well.
>>
>> 2010/1/11 Markus Jelsma <ma...@buyways.nl>
>>
>>> Hello,
>>>
>>>
>>> We have implemented language specific search in Solr using language
>>> specific fields and field types. For instance, an en_text field type can
>>> use an English stemmer, and list of stopwords and synonyms. We, however
>>> did not use specific stopwords, instead we used one list shared by both
>>> languages.
>>>
>>> So you would have a field type like:
>>> <fieldType name="en_text" class="solr.TextField" ...
>>>  <analyzer type="">
>>>  <filter class="solr.StopFilterFactory" words="stopwords.en.txt">
>>>  <filter class="solr.SynonymFilterFactory" synonyms="synoyms.en.txt">
>>>
>>> etc etc.
>>>
>>>
>>>
>>> Cheers,
>>>
>>> -
>>> Markus Jelsma          Buyways B.V.
>>> Technisch Architect    Friesestraatweg 215c
>>> http://www.buyways.nl  9743 AD Groningen
>>>
>>>
>>> Alg. 050-853 6600      KvK  01074105
>>> Tel. 050-853 6620      Fax. 050-3118124
>>> Mob. 06-5025 8350      In: http://www.linkedin.com/in/markus17
>>>
>>>
>>> On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote:
>>>
>>> > Hi Solr users.
>>> >
>>> > I'm trying to set up a site with Solr search integrated. And I use the
>>> > SolJava API to feed the index with search documents. At the moment I
>>> > have only activated search on the English portion of the site. I'm
>>> > interested in using as many features of solr as possible. Synonyms,
>>> > Stopwords and stems all sounds quite interesting and useful but how do
>>> > I set up this in a good way for a multilingual site?
>>> >
>>> > The site don't have a huge text mass so performance issues don't
>>> > really bother me but still I'd like to hear your suggestions before I
>>> > try to implement an solution.
>>> >
>>> > Best regards
>>> >
>>> > Daniel
>>>
>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>



-- 
Robert Muir
rcmuir@gmail.com

Re: Multi language support

Posted by Lance Norskog <go...@gmail.com>.
There are a lot of projects that don't use stopwords any more. You
might consider dropping them altogether.

On Mon, Jan 11, 2010 at 2:25 PM, Don Werve <do...@madwombat.com> wrote:
> This is the way I've implemented multilingual search as well.
>
> 2010/1/11 Markus Jelsma <ma...@buyways.nl>
>
>> Hello,
>>
>>
>> We have implemented language specific search in Solr using language
>> specific fields and field types. For instance, an en_text field type can
>> use an English stemmer, and list of stopwords and synonyms. We, however
>> did not use specific stopwords, instead we used one list shared by both
>> languages.
>>
>> So you would have a field type like:
>> <fieldType name="en_text" class="solr.TextField" ...
>>  <analyzer type="">
>>  <filter class="solr.StopFilterFactory" words="stopwords.en.txt">
>>  <filter class="solr.SynonymFilterFactory" synonyms="synoyms.en.txt">
>>
>> etc etc.
>>
>>
>>
>> Cheers,
>>
>> -
>> Markus Jelsma          Buyways B.V.
>> Technisch Architect    Friesestraatweg 215c
>> http://www.buyways.nl  9743 AD Groningen
>>
>>
>> Alg. 050-853 6600      KvK  01074105
>> Tel. 050-853 6620      Fax. 050-3118124
>> Mob. 06-5025 8350      In: http://www.linkedin.com/in/markus17
>>
>>
>> On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote:
>>
>> > Hi Solr users.
>> >
>> > I'm trying to set up a site with Solr search integrated. And I use the
>> > SolJava API to feed the index with search documents. At the moment I
>> > have only activated search on the English portion of the site. I'm
>> > interested in using as many features of solr as possible. Synonyms,
>> > Stopwords and stems all sounds quite interesting and useful but how do
>> > I set up this in a good way for a multilingual site?
>> >
>> > The site don't have a huge text mass so performance issues don't
>> > really bother me but still I'd like to hear your suggestions before I
>> > try to implement an solution.
>> >
>> > Best regards
>> >
>> > Daniel
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Multi language support

Posted by Don Werve <do...@madwombat.com>.
This is the way I've implemented multilingual search as well.

2010/1/11 Markus Jelsma <ma...@buyways.nl>

> Hello,
>
>
> We have implemented language specific search in Solr using language
> specific fields and field types. For instance, an en_text field type can
> use an English stemmer, and list of stopwords and synonyms. We, however
> did not use specific stopwords, instead we used one list shared by both
> languages.
>
> So you would have a field type like:
> <fieldType name="en_text" class="solr.TextField" ...
>  <analyzer type="">
>  <filter class="solr.StopFilterFactory" words="stopwords.en.txt">
>  <filter class="solr.SynonymFilterFactory" synonyms="synoyms.en.txt">
>
> etc etc.
>
>
>
> Cheers,
>
> -
> Markus Jelsma          Buyways B.V.
> Technisch Architect    Friesestraatweg 215c
> http://www.buyways.nl  9743 AD Groningen
>
>
> Alg. 050-853 6600      KvK  01074105
> Tel. 050-853 6620      Fax. 050-3118124
> Mob. 06-5025 8350      In: http://www.linkedin.com/in/markus17
>
>
> On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote:
>
> > Hi Solr users.
> >
> > I'm trying to set up a site with Solr search integrated. And I use the
> > SolJava API to feed the index with search documents. At the moment I
> > have only activated search on the English portion of the site. I'm
> > interested in using as many features of solr as possible. Synonyms,
> > Stopwords and stems all sounds quite interesting and useful but how do
> > I set up this in a good way for a multilingual site?
> >
> > The site don't have a huge text mass so performance issues don't
> > really bother me but still I'd like to hear your suggestions before I
> > try to implement an solution.
> >
> > Best regards
> >
> > Daniel
>

Re: Multi language support

Posted by Markus Jelsma <ma...@buyways.nl>.
Hello,


We have implemented language specific search in Solr using language
specific fields and field types. For instance, an en_text field type can
use an English stemmer, and list of stopwords and synonyms. We, however
did not use specific stopwords, instead we used one list shared by both
languages.

So you would have a field type like:
<fieldType name="en_text" class="solr.TextField" ...
 <analyzer type="">
  <filter class="solr.StopFilterFactory" words="stopwords.en.txt">
  <filter class="solr.SynonymFilterFactory" synonyms="synoyms.en.txt">

etc etc.



Cheers,

-  
Markus Jelsma          Buyways B.V.            
Technisch Architect    Friesestraatweg 215c    
http://www.buyways.nl  9743 AD Groningen       


Alg. 050-853 6600      KvK  01074105
Tel. 050-853 6620      Fax. 050-3118124
Mob. 06-5025 8350      In: http://www.linkedin.com/in/markus17


On Mon, 2010-01-11 at 13:45 +0100, Daniel Persson wrote:

> Hi Solr users.
> 
> I'm trying to set up a site with Solr search integrated. And I use the
> SolJava API to feed the index with search documents. At the moment I
> have only activated search on the English portion of the site. I'm
> interested in using as many features of solr as possible. Synonyms,
> Stopwords and stems all sounds quite interesting and useful but how do
> I set up this in a good way for a multilingual site?
> 
> The site don't have a huge text mass so performance issues don't
> really bother me but still I'd like to hear your suggestions before I
> try to implement an solution.
> 
> Best regards
> 
> Daniel