You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Julian Davchev <jm...@drun.net> on 2009/01/28 23:21:30 UTC

multilanguage + howto search in all languages?

Hi,
I currently have two indexes with solr. One for english version and one
with german version. They use respectively english/german2 snowball
factory.
Right now depending on which language is website currently I query
corresponding index.
There is requirement though that stuff is found regardless in which
language is found.
So for example if searching for muenchen (will be caught correctly by
german snowball factory as münchen) in english index it should be found.
Right now
it is not as I suppose english factory doesn't really care about umlauts.

Any pointers are more than welcome. I am considering synonyms  but this
will be kinda to heavy to follow/create.
Cheers,
JD

Re: multilanguage + howto search in all languages?

Posted by Julian Davchev <jm...@drun.net>.

Thank you both for points. For now I am hanlding with fuzzy search.
Let's hope this will do for sometime :)


Walter Underwood wrote:
> I've done this. There are five cases for the tokens in the search
> index:
>
> 1. Tokens that are unique after stemming (this is good).
> 2. Tokens that are common after stemming (usually trademarks,
>    like LaserJet).
> 3. Tokens with collisions after stemming:
>    German "mit", "MIT" the university
>    German "Boot" (boat), English "boot" (a heavy shoe)
> 4. Tokens with collisions in the surface form:
>    Dutch "mobile" (plural of furniture), English "mobile"
>    German "die" (stemmed to "das"), English "die"
>
> You cannot fix every spurious match, but you can do OK with
> stemmed fields for each language and a raw (unstemmed surface
> token) field.
>
> I won't recommend weights, but you could have fields for
> text_en, text_de, and text_raw, for example.
>
> You really cannot automatically determine the language of a
> query, mostly because of proper nouns, especially trademarks.
> Identify the language of these queries:
>
> * Google
> * LaserJet
> * Obama
> * Las Vegas
> * Paris
>
> HTTP supports an Accept-Language header, but I have no idea
> how often that is sent. We honored that in Ultraseek, mostly
> because it was standard.
>
> Finally, if you are working with localization, please take the
> time to understand the difference between ISO language codes
> and ISO country codes.
>
> wunder
>
> On 1/28/09 4:47 PM, "Erick Erickson" <er...@gmail.com> wrote:
>
>   
>> I'm not entirely sure about the fine points, but consider the
>> filters that are available that fold all the diacritics into their
>> low-ascii equivalents. Perhaps using that filter at *both* index
>> and search time on the English index would do the trick.
>>
>> In your example, both would be 'munchen'. Straight English
>> would be unaffected by the filter, but any German words with
>> diacritics that crept in would be folded into their low-ascii
>> "equivalents". This would also work at index time, just in case
>> you indexed English text that had some German words.
>>
>> NOTE: My experience is more on the Lucene side than the SOLR
>> side, but I'm sure the filters are available.
>>
>> Best
>> Erick
>>
>> On Wed, Jan 28, 2009 at 5:21 PM, Julian Davchev <jm...@drun.net> wrote:
>>
>>     
>>> Hi,
>>> I currently have two indexes with solr. One for english version and one
>>> with german version. They use respectively english/german2 snowball
>>> factory.
>>> Right now depending on which language is website currently I query
>>> corresponding index.
>>> There is requirement though that stuff is found regardless in which
>>> language is found.
>>> So for example if searching for muenchen (will be caught correctly by
>>> german snowball factory as münchen) in english index it should be found.
>>> Right now
>>> it is not as I suppose english factory doesn't really care about umlauts.
>>>
>>> Any pointers are more than welcome. I am considering synonyms  but this
>>> will be kinda to heavy to follow/create.
>>> Cheers,
>>> JD
>>>
>>>       
>
>

Re: multilanguage + howto search in all languages?

Posted by Walter Underwood <wu...@netflix.com>.

Duh. Four cases. For extra credit, what language is "wunder" in?

wunder

On 1/28/09 5:12 PM, "Walter Underwood" <wu...@netflix.com> wrote:

> I've done this. There are five cases for the tokens in the search
> index:
> 
> 1. Tokens that are unique after stemming (this is good).
> 2. Tokens that are common after stemming (usually trademarks,
>    like LaserJet).
> 3. Tokens with collisions after stemming:
>    German "mit", "MIT" the university
>    German "Boot" (boat), English "boot" (a heavy shoe)
> 4. Tokens with collisions in the surface form:
>    Dutch "mobile" (plural of furniture), English "mobile"
>    German "die" (stemmed to "das"), English "die"
> 
> You cannot fix every spurious match, but you can do OK with
> stemmed fields for each language and a raw (unstemmed surface
> token) field.
> 
> I won't recommend weights, but you could have fields for
> text_en, text_de, and text_raw, for example.
> 
> You really cannot automatically determine the language of a
> query, mostly because of proper nouns, especially trademarks.
> Identify the language of these queries:
> 
> * Google
> * LaserJet
> * Obama
> * Las Vegas
> * Paris
> 
> HTTP supports an Accept-Language header, but I have no idea
> how often that is sent. We honored that in Ultraseek, mostly
> because it was standard.
> 
> Finally, if you are working with localization, please take the
> time to understand the difference between ISO language codes
> and ISO country codes.
> 
> wunder
> 
> On 1/28/09 4:47 PM, "Erick Erickson" <er...@gmail.com> wrote:
> 
>> I'm not entirely sure about the fine points, but consider the
>> filters that are available that fold all the diacritics into their
>> low-ascii equivalents. Perhaps using that filter at *both* index
>> and search time on the English index would do the trick.
>> 
>> In your example, both would be 'munchen'. Straight English
>> would be unaffected by the filter, but any German words with
>> diacritics that crept in would be folded into their low-ascii
>> "equivalents". This would also work at index time, just in case
>> you indexed English text that had some German words.
>> 
>> NOTE: My experience is more on the Lucene side than the SOLR
>> side, but I'm sure the filters are available.
>> 
>> Best
>> Erick
>> 
>> On Wed, Jan 28, 2009 at 5:21 PM, Julian Davchev <jm...@drun.net> wrote:
>> 
>>> Hi,
>>> I currently have two indexes with solr. One for english version and one
>>> with german version. They use respectively english/german2 snowball
>>> factory.
>>> Right now depending on which language is website currently I query
>>> corresponding index.
>>> There is requirement though that stuff is found regardless in which
>>> language is found.
>>> So for example if searching for muenchen (will be caught correctly by
>>> german snowball factory as münchen) in english index it should be found.
>>> Right now
>>> it is not as I suppose english factory doesn't really care about umlauts.
>>> 
>>> Any pointers are more than welcome. I am considering synonyms  but this
>>> will be kinda to heavy to follow/create.
>>> Cheers,
>>> JD
>>> 
>

Re: multilanguage + howto search in all languages?

Posted by Walter Underwood <wu...@netflix.com>.

I've done this. There are five cases for the tokens in the search
index:

1. Tokens that are unique after stemming (this is good).
2. Tokens that are common after stemming (usually trademarks,
   like LaserJet).
3. Tokens with collisions after stemming:
   German "mit", "MIT" the university
   German "Boot" (boat), English "boot" (a heavy shoe)
4. Tokens with collisions in the surface form:
   Dutch "mobile" (plural of furniture), English "mobile"
   German "die" (stemmed to "das"), English "die"

You cannot fix every spurious match, but you can do OK with
stemmed fields for each language and a raw (unstemmed surface
token) field.

I won't recommend weights, but you could have fields for
text_en, text_de, and text_raw, for example.

You really cannot automatically determine the language of a
query, mostly because of proper nouns, especially trademarks.
Identify the language of these queries:

* Google
* LaserJet
* Obama
* Las Vegas
* Paris

HTTP supports an Accept-Language header, but I have no idea
how often that is sent. We honored that in Ultraseek, mostly
because it was standard.

Finally, if you are working with localization, please take the
time to understand the difference between ISO language codes
and ISO country codes.

wunder

On 1/28/09 4:47 PM, "Erick Erickson" <er...@gmail.com> wrote:

> I'm not entirely sure about the fine points, but consider the
> filters that are available that fold all the diacritics into their
> low-ascii equivalents. Perhaps using that filter at *both* index
> and search time on the English index would do the trick.
> 
> In your example, both would be 'munchen'. Straight English
> would be unaffected by the filter, but any German words with
> diacritics that crept in would be folded into their low-ascii
> "equivalents". This would also work at index time, just in case
> you indexed English text that had some German words.
> 
> NOTE: My experience is more on the Lucene side than the SOLR
> side, but I'm sure the filters are available.
> 
> Best
> Erick
> 
> On Wed, Jan 28, 2009 at 5:21 PM, Julian Davchev <jm...@drun.net> wrote:
> 
>> Hi,
>> I currently have two indexes with solr. One for english version and one
>> with german version. They use respectively english/german2 snowball
>> factory.
>> Right now depending on which language is website currently I query
>> corresponding index.
>> There is requirement though that stuff is found regardless in which
>> language is found.
>> So for example if searching for muenchen (will be caught correctly by
>> german snowball factory as münchen) in english index it should be found.
>> Right now
>> it is not as I suppose english factory doesn't really care about umlauts.
>> 
>> Any pointers are more than welcome. I am considering synonyms  but this
>> will be kinda to heavy to follow/create.
>> Cheers,
>> JD
>>

Re: multilanguage + howto search in all languages?

Posted by Erick Erickson <er...@gmail.com>.

I'm not entirely sure about the fine points, but consider the
filters that are available that fold all the diacritics into their
low-ascii equivalents. Perhaps using that filter at *both* index
and search time on the English index would do the trick.

In your example, both would be 'munchen'. Straight English
would be unaffected by the filter, but any German words with
diacritics that crept in would be folded into their low-ascii
"equivalents". This would also work at index time, just in case
you indexed English text that had some German words.

NOTE: My experience is more on the Lucene side than the SOLR
side, but I'm sure the filters are available.

Best
Erick

On Wed, Jan 28, 2009 at 5:21 PM, Julian Davchev <jm...@drun.net> wrote:

> Hi,
> I currently have two indexes with solr. One for english version and one
> with german version. They use respectively english/german2 snowball
> factory.
> Right now depending on which language is website currently I query
> corresponding index.
> There is requirement though that stuff is found regardless in which
> language is found.
> So for example if searching for muenchen (will be caught correctly by
> german snowball factory as münchen) in english index it should be found.
> Right now
> it is not as I suppose english factory doesn't really care about umlauts.
>
> Any pointers are more than welcome. I am considering synonyms  but this
> will be kinda to heavy to follow/create.
> Cheers,
> JD
>