You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Riedl, Johannes" <jo...@uni-tuebingen.de> on 2016/06/05 20:57:10 UTC

Multilingual Solr

Hi all,

we are currently in search of a solution for switching between different languages in the query results and keeping the possibility to perform a search in several languages in parallel.  The overall aim would be a constant field name and a an additional Solr parameter "lang=XX_YY" that allows to return the results in the chosen language while searches are applied to all languages. Setting up several cores to obtain a generic field name is not an option. Does anyone know of a clean way to achieve this, particularly routing content indexed to a generic field (e.g. title) to a "background field" (e.g. title_en, title_fr) etc on the fly and retrieving it from there depending on the language chosen.

Background: So far, we have investigated the multi-language field approach offered by Trey Grainger in the code examples for "Solr in Action" (https://github.com/treygrainger/solr-in-action.git, chapter 14), an extension to the ordinary textField that allows to use a generic field name and the language is encoded at the beginning of the field content and appropriate index and query analyzers associated to dummy fields in schema.xml. If there is a way to store data in these dummy fields and additionally the lang parameter is added we might be done.

Thanks a lot, best regards

Johannes

Re: Multilingual Solr

Posted by Johannes Riedl <jo...@uni-tuebingen.de>.
Hi Alessandro, hi Alexandre,

Thanks a lot for your reply and your considerations and hints. We use a 
web front end that comes bundled with Solr. It currently uses a single 
core approach. We would like to stick to the original setup as closely 
as possible to avoid administrative overhead and to not prevent the 
possible use of several cores in a different context in the future. This 
is the reason why we would like to hide the language fields completely 
from the front end apart from specifying an additional language 
parameter. Language detection on indexing is currently not an issue for 
us, as we get the input in a standardized format and thus can determine 
the language beforehand.

https://github.com/treygrainger/solr-in-action/blob/master/example-docs/ch14/cores/multi-language-field/conf/schema.xml 
shows an example how the multiText field type makes use of language 
specific field types to specify the analyzers that are being used. The 
core issue for us (pun intended ;-)) is to find out whether it is 
possible to extend this approach to only return the selected 
language(s), i.e. to transparently add something like nested documents.

Best regards

Johannes


On 06.06.2016 10:10, Alessandro Benedetti wrote:
> Hi Johannes,
> nothing out of the box unfortunately but could be a nice idea and
> contribution.
> If having a multi-core setup is not an option ( out of curiousity, can I
> ask why ?)
> you could proceed in this way :
>
> 1) you define in the schema N field variation per field you are interested
> in.
> N is the number of language you can support.
> Given for example the text field you define :
> text field not indexed, only stored
> text_en indexed
> text_fr indexed
> text_it indexed ...
>
> 2) At indexing time you can develop a custom updateRequestProcessor that
> will identify the language ( Solr internal libraries offer support for
> that) and address the correct text field to index the content .
> If you want to index also translations, you need to rely on some third
> party libraries to do that.
>
> 3) At query time you can address in parallel all the fields you want, with
> the edismax query parser for example .
>
> 4) For rendering the results, I don't have exactly clear, do you want to :
>
> a) translate the document content in the language you want, you could
> develop a custom DocTransformer that will take in input the language and
> translate, but I don't see that much benefit in that.
>
> b) return only the documents that originally were of that language. This
> case is easy, you add a fq at queyTime to filter only the documents of the
> language you want ( at indexing time you identify the language)
>
> c) return the original content of the document, this is quite easy. You can
> store the generic "text" field, and always return that.
>
> Let us know for further discussion,
>
> Cheers
>
> On Sun, Jun 5, 2016 at 9:57 PM, Riedl, Johannes <
> johannes.riedl@uni-tuebingen.de> wrote:
>
>> Hi all,
>>
>> we are currently in search of a solution for switching between different
>> languages in the query results and keeping the possibility to perform a
>> search in several languages in parallel.  The overall aim would be a
>> constant field name and a an additional Solr parameter "lang=XX_YY" that
>> allows to return the results in the chosen language while searches are
>> applied to all languages. Setting up several cores to obtain a generic
>> field name is not an option. Does anyone know of a clean way to achieve
>> this, particularly routing content indexed to a generic field (e.g. title)
>> to a "background field" (e.g. title_en, title_fr) etc on the fly and
>> retrieving it from there depending on the language chosen.
>>
>> Background: So far, we have investigated the multi-language field approach
>> offered by Trey Grainger in the code examples for "Solr in Action" (
>> https://github.com/treygrainger/solr-in-action.git, chapter 14), an
>> extension to the ordinary textField that allows to use a generic field name
>> and the language is encoded at the beginning of the field content and
>> appropriate index and query analyzers associated to dummy fields in
>> schema.xml. If there is a way to store data in these dummy fields and
>> additionally the lang parameter is added we might be done.
>>
>> Thanks a lot, best regards
>>
>> Johannes
>>
>
>


Re: Multilingual Solr

Posted by Alessandro Benedetti <ab...@apache.org>.
Hi Johannes,
nothing out of the box unfortunately but could be a nice idea and
contribution.
If having a multi-core setup is not an option ( out of curiousity, can I
ask why ?)
you could proceed in this way :

1) you define in the schema N field variation per field you are interested
in.
N is the number of language you can support.
Given for example the text field you define :
text field not indexed, only stored
text_en indexed
text_fr indexed
text_it indexed ...

2) At indexing time you can develop a custom updateRequestProcessor that
will identify the language ( Solr internal libraries offer support for
that) and address the correct text field to index the content .
If you want to index also translations, you need to rely on some third
party libraries to do that.

3) At query time you can address in parallel all the fields you want, with
the edismax query parser for example .

4) For rendering the results, I don't have exactly clear, do you want to :

a) translate the document content in the language you want, you could
develop a custom DocTransformer that will take in input the language and
translate, but I don't see that much benefit in that.

b) return only the documents that originally were of that language. This
case is easy, you add a fq at queyTime to filter only the documents of the
language you want ( at indexing time you identify the language)

c) return the original content of the document, this is quite easy. You can
store the generic "text" field, and always return that.

Let us know for further discussion,

Cheers

On Sun, Jun 5, 2016 at 9:57 PM, Riedl, Johannes <
johannes.riedl@uni-tuebingen.de> wrote:

> Hi all,
>
> we are currently in search of a solution for switching between different
> languages in the query results and keeping the possibility to perform a
> search in several languages in parallel.  The overall aim would be a
> constant field name and a an additional Solr parameter "lang=XX_YY" that
> allows to return the results in the chosen language while searches are
> applied to all languages. Setting up several cores to obtain a generic
> field name is not an option. Does anyone know of a clean way to achieve
> this, particularly routing content indexed to a generic field (e.g. title)
> to a "background field" (e.g. title_en, title_fr) etc on the fly and
> retrieving it from there depending on the language chosen.
>
> Background: So far, we have investigated the multi-language field approach
> offered by Trey Grainger in the code examples for "Solr in Action" (
> https://github.com/treygrainger/solr-in-action.git, chapter 14), an
> extension to the ordinary textField that allows to use a generic field name
> and the language is encoded at the beginning of the field content and
> appropriate index and query analyzers associated to dummy fields in
> schema.xml. If there is a way to store data in these dummy fields and
> additionally the lang parameter is added we might be done.
>
> Thanks a lot, best regards
>
> Johannes
>



-- 
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: Multilingual Solr

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
There is a language auto-detect UpdateRequestProcessor to route
indexed content to differently suffixed fields. You have Google's
algorithm: http://www.solr-start.com/info/update-request-processors/#LangDetectLanguageIdentifierUpdateProcessorFactory
or a Tika one: http://www.solr-start.com/info/update-request-processors/#TikaLanguageIdentifierUpdateProcessorFactory

To map during retrieval, you could use aliases, like I did in my book
example some years ago:
https://github.com/arafalov/solr-indexing-book/blob/master/published/languages/conf/solrconfig.xml#L20

Does this cover your needs?

Regards,
   Alex.
----
Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 6 June 2016 at 06:57, Riedl, Johannes
<jo...@uni-tuebingen.de> wrote:
> Hi all,
>
> we are currently in search of a solution for switching between different languages in the query results and keeping the possibility to perform a search in several languages in parallel.  The overall aim would be a constant field name and a an additional Solr parameter "lang=XX_YY" that allows to return the results in the chosen language while searches are applied to all languages. Setting up several cores to obtain a generic field name is not an option. Does anyone know of a clean way to achieve this, particularly routing content indexed to a generic field (e.g. title) to a "background field" (e.g. title_en, title_fr) etc on the fly and retrieving it from there depending on the language chosen.
>
> Background: So far, we have investigated the multi-language field approach offered by Trey Grainger in the code examples for "Solr in Action" (https://github.com/treygrainger/solr-in-action.git, chapter 14), an extension to the ordinary textField that allows to use a generic field name and the language is encoded at the beginning of the field content and appropriate index and query analyzers associated to dummy fields in schema.xml. If there is a way to store data in these dummy fields and additionally the lang parameter is added we might be done.
>
> Thanks a lot, best regards
>
> Johannes