You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Vinay B," <vy...@gmail.com> on 2013/03/01 21:49:03 UTC

Language Identification and Stemming

As I understand, SOLR allows us to plug in language detection
processors: http://wiki.apache.org/solr/LanguageDetection

GIven that our use case involves a collection of mixed language documents,
Q1: Assume that we plug in language detection, will this affect the
stemming and other language specific operations eg. will the stemmers
use the correct language identified by the language detection code:
http://www.early-dance.de/news/9188-optimizing-apachesolr-non-english-languages
Q2. Currently, we don't explicitly use a processor chain  for our
updates, .. just a custom update handler that also returns custom
opcodes etc in the response. If we plug  language detection via an
update chain connected to this request handler, (how) can we pass the
chosen language back via the response?

    <requestHandler name="/update/myupdatet"
                  class="com.xyz.MyDocUpdateHandler" />

Thanks

Re: Language Identification and Stemming

Posted by Jan Høydahl <ja...@cominvent.com>.

In addition to the text_lang fields you can of course have a text_general
field which is unstemmed, where you put documents that you don't yet have
language specific handling for.

One potential issue of multi language search is detecting the language of the query itself.
Sometimes your search page knows in advance what language will be input, then you can
target the search towards text_<lang> only. Other times you won't know what language
it is, and then you have a few choices:

a) Try to detect the language
b) Search across all languages (text_en OR text_fr OR ...)
c) Skip stemming and use only text_general

Detecting the language of a short 1-2 words query is hard. You will be able
to distinguish chinese from japanese from western languages based on unique characters,
but much harder to distinguish western languages.

Search across all languages works great, but you may get some false positives in
e.g. stemming when a word overlaps with different meaning in several languages.
Besides, if you have 200 languages in your index it is impractical to search across
200 fields. 

If you skip stemming you will in many cases still be able to build a great search,
but you may be better off trying to guess the input language by means of IP detection,
browser headers, statistical analysis or simply asking the user.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

1. mars 2013 kl. 23:47 skrev vybe3142 <vy...@gmail.com>:

> From your response, I gather that there's no way to maintain a single set of
> fields for multiple languages i.e. I can't use a field "text" for the body
> text. Instead, I would have to define text_en, text_fr, text_ru etc each
> mapped to their specific languages.
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Language-Identification-and-Stemming-tp4044116p4044132.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Language Identification and Stemming

Posted by Otis Gospodnetic <ot...@gmail.com>.

Yes, you would want fields for different languages to be separate.

Otis
--
Solr & ElasticSearch Support
http://sematext.com/





On Fri, Mar 1, 2013 at 5:47 PM, vybe3142 <vy...@gmail.com> wrote:

> From your response, I gather that there's no way to maintain a single set
> of
> fields for multiple languages i.e. I can't use a field "text" for the body
> text. Instead, I would have to define text_en, text_fr, text_ru etc each
> mapped to their specific languages.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Language-Identification-and-Stemming-tp4044116p4044132.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Language Identification and Stemming

Posted by vybe3142 <vy...@gmail.com>.

>From your response, I gather that there's no way to maintain a single set of
fields for multiple languages i.e. I can't use a field "text" for the body
text. Instead, I would have to define text_en, text_fr, text_ru etc each
mapped to their specific languages.



--
View this message in context: http://lucene.472066.n3.nabble.com/Language-Identification-and-Stemming-tp4044116p4044132.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Language Identification and Stemming

Posted by vybe3142 <vy...@gmail.com>.

Thank You



--
View this message in context: http://lucene.472066.n3.nabble.com/Language-Identification-and-Stemming-tp4044116p4044129.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Language Identification and Stemming

Posted by Jan Høydahl <ja...@cominvent.com>.

Hi,

Q1. You use langid for the detection, and your chosen field(s) can be mapped to
new names such as title->title_en or title_de. Thus you need to configure
your schema with a separate fieldType for every language you want to support
if you'd like to use language specific stemming and stopwords etc.

Q2. You setup update.chain in your request handler and that's it.
It is not possible to return to the client the detected language or any
other response from the UpdateProcessors. You'll need to fetch the indexed
document.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

1. mars 2013 kl. 21:49 skrev "Vinay B," <vy...@gmail.com>:

> As I understand, SOLR allows us to plug in language detection
> processors: http://wiki.apache.org/solr/LanguageDetection
> 
> GIven that our use case involves a collection of mixed language documents,
> Q1: Assume that we plug in language detection, will this affect the
> stemming and other language specific operations eg. will the stemmers
> use the correct language identified by the language detection code:
> http://www.early-dance.de/news/9188-optimizing-apachesolr-non-english-languages
> Q2. Currently, we don't explicitly use a processor chain  for our
> updates, .. just a custom update handler that also returns custom
> opcodes etc in the response. If we plug  language detection via an
> update chain connected to this request handler, (how) can we pass the
> chosen language back via the response?
> 
>    <requestHandler name="/update/myupdatet"
>                  class="com.xyz.MyDocUpdateHandler" />
> 
> Thanks