You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Paul Borgermans <pa...@gmail.com> on 2008/12/17 17:25:19 UTC

Re: umlaut index ö == o == oe Possible?

The IsoLatin1 Filter doe smost of this, œ, ö are both converted to o

hth
Paul

On Wed, Dec 17, 2008 at 1:27 AM, Stephen Weiss <sw...@stylesight.com>wrote:

> I believe the german porter stemmer should handle this.  I haven't used it
> with SOLR but I've used it with other projects, and basically, when the word
> is parsed, the umlauts and also accented vowels are converted to plain
> vowels.  I guess with SOLR you use solr.SnowballPorterFilterFactory:
>
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-b80fb581f4e078142c694014f1a8f60c0935e080
>
> with the German option (like in their example).
>
> You probably want to apply this both at index and query time.
>
> --
> Steve
>
>
> On Dec 16, 2008, at 6:02 PM, Julian Davchev wrote:
>
>  Hi,
>> I am just going through
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters and maillist
>> archive
>> but somehow can't find the solution. Is it possible that I treat
>> 'möchten' , 'mochten' and  'moechten' the same way.
>> Of course not hardcoding this but rather work for any umlaut.
>> Cheers
>>
>>
>>
>

Re: looking for multilanguage indexing best practice/hint

Posted by Chris Hostetter <ho...@fucit.org>.

: Subject: looking for multilanguage indexing best practice/hint
: References: <49...@drun.net>	
:     <50...@stylesight.com>
:     <8c...@mail.gmail.com>
: In-Reply-To: <8c...@mail.gmail.com>

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/Thread_hijacking



-Hoss

Re: looking for multilanguage indexing best practice/hint

Posted by Julian Davchev <jm...@drun.net>.

Dude,
There was already a warning with stealing thread. Please do something
about it as advised.  Run your own if want answers for your problem.
Cheers,


Sujatha Arun wrote:
> Thanks Daniel and Erik,
>
> The requirement from the user end is to only search in that particular
> language and not across languages.
>
> Also going forward we will be adding more languages.
>
> so if i have separate fields for each language ,then we need to change the
> schema everytime and that will not scale very well.
>
> So there are two options ,either use dynamic fields  or use multi core .
>
> Please advice which is better in terms of scaling ,optimum use of existing
> resources (available  ram which is abt 4GB for several instances of solr) .
>
> If we use multicore ,will it degrade in terms of speed etc?
>
> Any pointers will be helpful
>
> Regards
> Sujatha
>
>
>
>
> On 12/19/08, Julian Davchev <jm...@drun.net> wrote:
>   
>> Thanks Erick,
>> I think I will go with different language fields as I want to give
>> different stop words, analyzers etc.
>> I might also consider scheme per language so scaling is more flexible as
>> I was already advised but this will really make sense if I have more
>> than one server I guess, else just all other data is duplicated for no
>> reason.
>> We already made decision that language will be passed each time in
>> search so won't make sense to search quert in any lang.
>>
>> As of CJKAnalyzer from first look doesn't seem to be in solr (haven't
>> tried yet) and since I am noob in java will check how it's done.
>> Will definately give a try.
>>
>> Thanks alot for help.
>>
>> Erick Erickson wrote:
>>     
>>> See the CJKAnalyzer for a start, StandardAnalyzer won't
>>> help you much.
>>>
>>> Also, tell us a little more about your requirements. For instance,
>>> if a user submits a query in Japanese, do you want to search
>>> across documents in the other languages too? And will you want
>>> to associate different analyzers with the content from different
>>> languages? You really have two options:
>>>
>>> if you want different analyzers used with the different languages,
>>> you probably have to index the content in different fields. That is
>>> a Chinese document would have a chinese_content field, a Japanese
>>> document would have a japanese_content field etc. Now you can
>>> associate a different analyzer with each *_content field.
>>>
>>> If the same analyzer would work for all three languages, you
>>> can just index all the content in a "content" field, and if you
>>> need to restrict searching to the language in which the query
>>> was submitted, you could always add a clause on the
>>> language, e.g. AND language:chinese
>>>
>>> Hope this helps
>>> Erick
>>>
>>> On Wed, Dec 17, 2008 at 11:15 PM, Sujatha Arun <su...@gmail.com>
>>>       
>> wrote:
>>     
>>>       
>>>> Hi,
>>>>
>>>> I am prototyping lanuage search using solr 1.3 .I  have 3 fields in the
>>>> schema -id,content and language.
>>>>
>>>> I am indexing 3 pdf files ,the languages are foroyo,chinese and
>>>>         
>> japanese.
>>     
>>>> I use xpdf to convert the content of pdf to text and push the text to
>>>>         
>> solr
>>     
>>>> in the content field.
>>>>
>>>> What is the analyzer  that i need to use for the above.
>>>>
>>>> By using the default text analyzer and posting this content to solr, i
>>>>         
>> am
>>     
>>>> not getting any  results.
>>>>
>>>> Does solr support stemmin for the above languages.
>>>>
>>>> Regards
>>>> Sujatha
>>>>
>>>>
>>>>
>>>>
>>>> On 12/18/08, Feak, Todd <To...@smss.sony.com> wrote:
>>>>
>>>>         
>>>>> Don't forget to consider scaling concerns (if there are any). There are
>>>>> strong differences in the number of searches we receive for each
>>>>> language. We chose to create separate schema and config per language so
>>>>> that we can throw servers at a particular language (or set of
>>>>>           
>> languages)
>>     
>>>>> if we needed to. We see 2 orders of magnitude difference between our
>>>>> most popular language and our least popular.
>>>>>
>>>>> -Todd Feak
>>>>>
>>>>> -----Original Message-----
>>>>> From: Julian Davchev [mailto:jmut@drun.net]
>>>>> Sent: Wednesday, December 17, 2008 11:31 AM
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: looking for multilanguage indexing best practice/hint
>>>>>
>>>>> Hi,
>>>>> From my study on solr and lucene so far it seems that I will use single
>>>>> scheme.....at least don't see scenario where I'd need more than that.
>>>>> So question is how do I approach multilanguage indexing and multilang
>>>>> searching. Will it really make sense for just searching word..or rather
>>>>> I should supply lang param to search as well.
>>>>>
>>>>> I see there are those filters and already advised on them but I guess
>>>>> question is more of a best practice.
>>>>> solr.ISOLatin1AccentFilterFactory, solr.SnowballPorterFilterFactory
>>>>>
>>>>> So solution I see is using copyField I have same field in different
>>>>> langs or something using distinct filter.
>>>>> Cheers
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>           
>>>       
>>     
>
>

Re: looking for multilanguage indexing best practice/hint

Posted by Sujatha Arun <su...@gmail.com>.

Thanks Daniel and Erik,

The requirement from the user end is to only search in that particular
language and not across languages.

Also going forward we will be adding more languages.

so if i have separate fields for each language ,then we need to change the
schema everytime and that will not scale very well.

So there are two options ,either use dynamic fields  or use multi core .

Please advice which is better in terms of scaling ,optimum use of existing
resources (available  ram which is abt 4GB for several instances of solr) .

If we use multicore ,will it degrade in terms of speed etc?

Any pointers will be helpful

Regards
Sujatha




On 12/19/08, Julian Davchev <jm...@drun.net> wrote:
>
> Thanks Erick,
> I think I will go with different language fields as I want to give
> different stop words, analyzers etc.
> I might also consider scheme per language so scaling is more flexible as
> I was already advised but this will really make sense if I have more
> than one server I guess, else just all other data is duplicated for no
> reason.
> We already made decision that language will be passed each time in
> search so won't make sense to search quert in any lang.
>
> As of CJKAnalyzer from first look doesn't seem to be in solr (haven't
> tried yet) and since I am noob in java will check how it's done.
> Will definately give a try.
>
> Thanks alot for help.
>
> Erick Erickson wrote:
> > See the CJKAnalyzer for a start, StandardAnalyzer won't
> > help you much.
> >
> > Also, tell us a little more about your requirements. For instance,
> > if a user submits a query in Japanese, do you want to search
> > across documents in the other languages too? And will you want
> > to associate different analyzers with the content from different
> > languages? You really have two options:
> >
> > if you want different analyzers used with the different languages,
> > you probably have to index the content in different fields. That is
> > a Chinese document would have a chinese_content field, a Japanese
> > document would have a japanese_content field etc. Now you can
> > associate a different analyzer with each *_content field.
> >
> > If the same analyzer would work for all three languages, you
> > can just index all the content in a "content" field, and if you
> > need to restrict searching to the language in which the query
> > was submitted, you could always add a clause on the
> > language, e.g. AND language:chinese
> >
> > Hope this helps
> > Erick
> >
> > On Wed, Dec 17, 2008 at 11:15 PM, Sujatha Arun <su...@gmail.com>
> wrote:
> >
> >
> >> Hi,
> >>
> >> I am prototyping lanuage search using solr 1.3 .I  have 3 fields in the
> >> schema -id,content and language.
> >>
> >> I am indexing 3 pdf files ,the languages are foroyo,chinese and
> japanese.
> >>
> >> I use xpdf to convert the content of pdf to text and push the text to
> solr
> >> in the content field.
> >>
> >> What is the analyzer  that i need to use for the above.
> >>
> >> By using the default text analyzer and posting this content to solr, i
> am
> >> not getting any  results.
> >>
> >> Does solr support stemmin for the above languages.
> >>
> >> Regards
> >> Sujatha
> >>
> >>
> >>
> >>
> >> On 12/18/08, Feak, Todd <To...@smss.sony.com> wrote:
> >>
> >>> Don't forget to consider scaling concerns (if there are any). There are
> >>> strong differences in the number of searches we receive for each
> >>> language. We chose to create separate schema and config per language so
> >>> that we can throw servers at a particular language (or set of
> languages)
> >>> if we needed to. We see 2 orders of magnitude difference between our
> >>> most popular language and our least popular.
> >>>
> >>> -Todd Feak
> >>>
> >>> -----Original Message-----
> >>> From: Julian Davchev [mailto:jmut@drun.net]
> >>> Sent: Wednesday, December 17, 2008 11:31 AM
> >>> To: solr-user@lucene.apache.org
> >>> Subject: looking for multilanguage indexing best practice/hint
> >>>
> >>> Hi,
> >>> From my study on solr and lucene so far it seems that I will use single
> >>> scheme.....at least don't see scenario where I'd need more than that.
> >>> So question is how do I approach multilanguage indexing and multilang
> >>> searching. Will it really make sense for just searching word..or rather
> >>> I should supply lang param to search as well.
> >>>
> >>> I see there are those filters and already advised on them but I guess
> >>> question is more of a best practice.
> >>> solr.ISOLatin1AccentFilterFactory, solr.SnowballPorterFilterFactory
> >>>
> >>> So solution I see is using copyField I have same field in different
> >>> langs or something using distinct filter.
> >>> Cheers
> >>>
> >>>
> >>>
> >>>
> >>>
> >
> >
>
>

Re: looking for multilanguage indexing best practice/hint

Posted by Julian Davchev <jm...@drun.net>.

Thanks Erick,
I think I will go with different language fields as I want to give
different stop words, analyzers etc.
I might also consider scheme per language so scaling is more flexible as
I was already advised but this will really make sense if I have more
than one server I guess, else just all other data is duplicated for no
reason.
We already made decision that language will be passed each time in
search so won't make sense to search quert in any lang.

As of CJKAnalyzer from first look doesn't seem to be in solr (haven't
tried yet) and since I am noob in java will check how it's done.
Will definately give a try.

Thanks alot for help.

Erick Erickson wrote:
> See the CJKAnalyzer for a start, StandardAnalyzer won't
> help you much.
>
> Also, tell us a little more about your requirements. For instance,
> if a user submits a query in Japanese, do you want to search
> across documents in the other languages too? And will you want
> to associate different analyzers with the content from different
> languages? You really have two options:
>
> if you want different analyzers used with the different languages,
> you probably have to index the content in different fields. That is
> a Chinese document would have a chinese_content field, a Japanese
> document would have a japanese_content field etc. Now you can
> associate a different analyzer with each *_content field.
>
> If the same analyzer would work for all three languages, you
> can just index all the content in a "content" field, and if you
> need to restrict searching to the language in which the query
> was submitted, you could always add a clause on the
> language, e.g. AND language:chinese
>
> Hope this helps
> Erick
>
> On Wed, Dec 17, 2008 at 11:15 PM, Sujatha Arun <su...@gmail.com> wrote:
>
>   
>> Hi,
>>
>> I am prototyping lanuage search using solr 1.3 .I  have 3 fields in the
>> schema -id,content and language.
>>
>> I am indexing 3 pdf files ,the languages are foroyo,chinese and japanese.
>>
>> I use xpdf to convert the content of pdf to text and push the text to solr
>> in the content field.
>>
>> What is the analyzer  that i need to use for the above.
>>
>> By using the default text analyzer and posting this content to solr, i am
>> not getting any  results.
>>
>> Does solr support stemmin for the above languages.
>>
>> Regards
>> Sujatha
>>
>>
>>
>>
>> On 12/18/08, Feak, Todd <To...@smss.sony.com> wrote:
>>     
>>> Don't forget to consider scaling concerns (if there are any). There are
>>> strong differences in the number of searches we receive for each
>>> language. We chose to create separate schema and config per language so
>>> that we can throw servers at a particular language (or set of languages)
>>> if we needed to. We see 2 orders of magnitude difference between our
>>> most popular language and our least popular.
>>>
>>> -Todd Feak
>>>
>>> -----Original Message-----
>>> From: Julian Davchev [mailto:jmut@drun.net]
>>> Sent: Wednesday, December 17, 2008 11:31 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: looking for multilanguage indexing best practice/hint
>>>
>>> Hi,
>>> From my study on solr and lucene so far it seems that I will use single
>>> scheme.....at least don't see scenario where I'd need more than that.
>>> So question is how do I approach multilanguage indexing and multilang
>>> searching. Will it really make sense for just searching word..or rather
>>> I should supply lang param to search as well.
>>>
>>> I see there are those filters and already advised on them but I guess
>>> question is more of a best practice.
>>> solr.ISOLatin1AccentFilterFactory, solr.SnowballPorterFilterFactory
>>>
>>> So solution I see is using copyField I have same field in different
>>> langs or something using distinct filter.
>>> Cheers
>>>
>>>
>>>
>>>
>>>       
>
>

Re: Solr and Autocompletion

Posted by Ryan McKinley <ry...@gmail.com>.

lots of options out there....

Rather then doing a slow query like Prefix, i think its best to index  
the ngrams so the autocomplete is a fast query.

http://www.mail-archive.com/solr-user@lucene.apache.org/msg06776.html

On Dec 18, 2008, at 11:56 AM, Kashyap, Raghu wrote:

> Hi,
>
>  One of things we are looking for is to Autofill the keywords when  
> people start typing. (e.g. Google autofill)
>
> Currently we are using the RangeQuery. I read about the PrefixQuery  
> and feel that it might be appropriate for this kind of implementation.
>
> Has anyone implemented the autofill feature? If so what do you  
> recommend?
>
> Thanks,
> Raghu

Re: Solr and Autocompletion

Posted by Chris Hostetter <ho...@fucit.org>.

: Subject: Solr and Autocompletion
: References: <49...@drun.net>
:      <50...@stylesight.com>
:      <8c...@mail.gmail.com>
:      <49...@drun.net>
:      <85...@mail-sd1.ad.soe.sony.com>
:      <41...@mail.gmail.com>
:  <35...@mail.gmail.com>
: In-Reply-To: <35...@mail.gmail.com>

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to
an existing message, instead start a fresh email.  Even if you change the
subject line of your email, other mail headers still track which thread
you replied to and your question is "hidden" in that thread and gets less
attention.   It makes following discussions in the mailing list archives
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/Thread_hijacking



-Hoss

Solr and Autocompletion

Posted by "Kashyap, Raghu" <Ra...@orbitz.com>.

Hi,

  One of things we are looking for is to Autofill the keywords when people start typing. (e.g. Google autofill)

Currently we are using the RangeQuery. I read about the PrefixQuery and feel that it might be appropriate for this kind of implementation.

Has anyone implemented the autofill feature? If so what do you recommend?

Thanks,
Raghu

Re: looking for multilanguage indexing best practice/hint

Posted by Erick Erickson <er...@gmail.com>.

See the CJKAnalyzer for a start, StandardAnalyzer won't
help you much.

Also, tell us a little more about your requirements. For instance,
if a user submits a query in Japanese, do you want to search
across documents in the other languages too? And will you want
to associate different analyzers with the content from different
languages? You really have two options:

if you want different analyzers used with the different languages,
you probably have to index the content in different fields. That is
a Chinese document would have a chinese_content field, a Japanese
document would have a japanese_content field etc. Now you can
associate a different analyzer with each *_content field.

If the same analyzer would work for all three languages, you
can just index all the content in a "content" field, and if you
need to restrict searching to the language in which the query
was submitted, you could always add a clause on the
language, e.g. AND language:chinese

Hope this helps
Erick

On Wed, Dec 17, 2008 at 11:15 PM, Sujatha Arun <su...@gmail.com> wrote:

> Hi,
>
> I am prototyping lanuage search using solr 1.3 .I  have 3 fields in the
> schema -id,content and language.
>
> I am indexing 3 pdf files ,the languages are foroyo,chinese and japanese.
>
> I use xpdf to convert the content of pdf to text and push the text to solr
> in the content field.
>
> What is the analyzer  that i need to use for the above.
>
> By using the default text analyzer and posting this content to solr, i am
> not getting any  results.
>
> Does solr support stemmin for the above languages.
>
> Regards
> Sujatha
>
>
>
>
> On 12/18/08, Feak, Todd <To...@smss.sony.com> wrote:
> >
> > Don't forget to consider scaling concerns (if there are any). There are
> > strong differences in the number of searches we receive for each
> > language. We chose to create separate schema and config per language so
> > that we can throw servers at a particular language (or set of languages)
> > if we needed to. We see 2 orders of magnitude difference between our
> > most popular language and our least popular.
> >
> > -Todd Feak
> >
> > -----Original Message-----
> > From: Julian Davchev [mailto:jmut@drun.net]
> > Sent: Wednesday, December 17, 2008 11:31 AM
> > To: solr-user@lucene.apache.org
> > Subject: looking for multilanguage indexing best practice/hint
> >
> > Hi,
> > From my study on solr and lucene so far it seems that I will use single
> > scheme.....at least don't see scenario where I'd need more than that.
> > So question is how do I approach multilanguage indexing and multilang
> > searching. Will it really make sense for just searching word..or rather
> > I should supply lang param to search as well.
> >
> > I see there are those filters and already advised on them but I guess
> > question is more of a best practice.
> > solr.ISOLatin1AccentFilterFactory, solr.SnowballPorterFilterFactory
> >
> > So solution I see is using copyField I have same field in different
> > langs or something using distinct filter.
> > Cheers
> >
> >
> >
> >
>

RE: looking for multilanguage indexing best practice/hint

Posted by Daniel Alheiros <Da...@bbc.co.uk>.

Hi Sujatha.

I've developed a search system for 6 different languages and as it was
implemented on Solr 1.2 all those languages are part of the same index,
using different fields for each so I can have different analyzers for
each one.

Like:
content_chinese
content_english
content_russian
content_arabic

I've also defined a language field that I use to be able to separate
those on query time.

As you are going to implement it using Solr 1.3 I would rather create
one core per language and keep my schema simpler without the _language
suffix. Each schema (one per language) would have only, say, content
which depending on its language will use a proper analyzer and filters.

Having a separate core per language is also good as the scores for a
language won't be affected by the indexing of documents in other
languages.

Do you have any requirement for searching in any language, say q=test
and this term should be found in any language? If so, you may think of
distributed search to combine your results or even to take the same
approach I've taken as I couldn't use multi-core.

I'm also using the Dismax request handler, that's worth to have a look
so you can pre-define some base query parts and also do score boosting
behind the scenes.

I hope it helps.

Regards,
Daniel 

-----Original Message-----
From: Sujatha Arun [mailto:suja.arun@gmail.com] 
Sent: 18 December 2008 04:15
To: solr-user@lucene.apache.org
Subject: Re: looking for multilanguage indexing best practice/hint

Hi,

I am prototyping lanuage search using solr 1.3 .I  have 3 fields in the
schema -id,content and language.

I am indexing 3 pdf files ,the languages are foroyo,chinese and
japanese.

I use xpdf to convert the content of pdf to text and push the text to
solr in the content field.

What is the analyzer  that i need to use for the above.

By using the default text analyzer and posting this content to solr, i
am not getting any  results.

Does solr support stemmin for the above languages.

Regards
Sujatha

On 12/18/08, Feak, Todd <To...@smss.sony.com> wrote:
>
> Don't forget to consider scaling concerns (if there are any). There 
> are strong differences in the number of searches we receive for each 
> language. We chose to create separate schema and config per language 
> so that we can throw servers at a particular language (or set of 
> languages) if we needed to. We see 2 orders of magnitude difference 
> between our most popular language and our least popular.
>
> -Todd Feak
>
> -----Original Message-----
> From: Julian Davchev [mailto:jmut@drun.net]
> Sent: Wednesday, December 17, 2008 11:31 AM
> To: solr-user@lucene.apache.org
> Subject: looking for multilanguage indexing best practice/hint
>
> Hi,
> From my study on solr and lucene so far it seems that I will use 
> single scheme.....at least don't see scenario where I'd need more than
that.
> So question is how do I approach multilanguage indexing and multilang 
> searching. Will it really make sense for just searching word..or 
> rather I should supply lang param to search as well.
>
> I see there are those filters and already advised on them but I guess 
> question is more of a best practice.
> solr.ISOLatin1AccentFilterFactory, solr.SnowballPorterFilterFactory
>
> So solution I see is using copyField I have same field in different 
> langs or something using distinct filter.
> Cheers
>
>
>
>

http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.

Re: looking for multilanguage indexing best practice/hint

Posted by Sujatha Arun <su...@gmail.com>.

Hi,

I am prototyping lanuage search using solr 1.3 .I  have 3 fields in the
schema -id,content and language.

I am indexing 3 pdf files ,the languages are foroyo,chinese and japanese.

I use xpdf to convert the content of pdf to text and push the text to solr
in the content field.

What is the analyzer  that i need to use for the above.

By using the default text analyzer and posting this content to solr, i am
not getting any  results.

Does solr support stemmin for the above languages.

Regards
Sujatha




On 12/18/08, Feak, Todd <To...@smss.sony.com> wrote:
>
> Don't forget to consider scaling concerns (if there are any). There are
> strong differences in the number of searches we receive for each
> language. We chose to create separate schema and config per language so
> that we can throw servers at a particular language (or set of languages)
> if we needed to. We see 2 orders of magnitude difference between our
> most popular language and our least popular.
>
> -Todd Feak
>
> -----Original Message-----
> From: Julian Davchev [mailto:jmut@drun.net]
> Sent: Wednesday, December 17, 2008 11:31 AM
> To: solr-user@lucene.apache.org
> Subject: looking for multilanguage indexing best practice/hint
>
> Hi,
> From my study on solr and lucene so far it seems that I will use single
> scheme.....at least don't see scenario where I'd need more than that.
> So question is how do I approach multilanguage indexing and multilang
> searching. Will it really make sense for just searching word..or rather
> I should supply lang param to search as well.
>
> I see there are those filters and already advised on them but I guess
> question is more of a best practice.
> solr.ISOLatin1AccentFilterFactory, solr.SnowballPorterFilterFactory
>
> So solution I see is using copyField I have same field in different
> langs or something using distinct filter.
> Cheers
>
>
>
>

RE: looking for multilanguage indexing best practice/hint

Posted by "Feak, Todd" <To...@smss.sony.com>.

Don't forget to consider scaling concerns (if there are any). There are
strong differences in the number of searches we receive for each
language. We chose to create separate schema and config per language so
that we can throw servers at a particular language (or set of languages)
if we needed to. We see 2 orders of magnitude difference between our
most popular language and our least popular.

-Todd Feak

-----Original Message-----
From: Julian Davchev [mailto:jmut@drun.net] 
Sent: Wednesday, December 17, 2008 11:31 AM
To: solr-user@lucene.apache.org
Subject: looking for multilanguage indexing best practice/hint

Hi,
>From my study on solr and lucene so far it seems that I will use single
scheme.....at least don't see scenario where I'd need more than that.
So question is how do I approach multilanguage indexing and multilang
searching. Will it really make sense for just searching word..or rather
I should supply lang param to search as well.

I see there are those filters and already advised on them but I guess
question is more of a best practice.
solr.ISOLatin1AccentFilterFactory, solr.SnowballPorterFilterFactory

So solution I see is using copyField I have same field in different
langs or something using distinct filter.
Cheers

Re: looking for multilanguage indexing best practice/hint

Posted by Alexander Ramos Jardim <al...@gmail.com>.

I think this is up to your needs.

If you will make one search in many languages, and your doc's won't get too
big, you can put all the data in one schema.xml and configure your field
types by a language basis.


2008/12/17 Julian Davchev <jm...@drun.net>

> Hi,
> From my study on solr and lucene so far it seems that I will use single
> scheme.....at least don't see scenario where I'd need more than that.
> So question is how do I approach multilanguage indexing and multilang
> searching. Will it really make sense for just searching word..or rather
> I should supply lang param to search as well.
>
> I see there are those filters and already advised on them but I guess
> question is more of a best practice.
> solr.ISOLatin1AccentFilterFactory, solr.SnowballPorterFilterFactory
>
> So solution I see is using copyField I have same field in different
> langs or something using distinct filter.
> Cheers
>
>
>


-- 
Alexander Ramos Jardim

looking for multilanguage indexing best practice/hint

Posted by Julian Davchev <jm...@drun.net>.

Hi,
>From my study on solr and lucene so far it seems that I will use single
scheme.....at least don't see scenario where I'd need more than that.
So question is how do I approach multilanguage indexing and multilang
searching. Will it really make sense for just searching word..or rather
I should supply lang param to search as well.

I see there are those filters and already advised on them but I guess
question is more of a best practice.
solr.ISOLatin1AccentFilterFactory, solr.SnowballPorterFilterFactory

So solution I see is using copyField I have same field in different
langs or something using distinct filter.
Cheers