You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by benjelloun <an...@gmail.com> on 2014/07/04 16:52:37 UTC

multilingual search

Hello,

what i need to do is to detect language of my fields then when i search with
"/select  RequestHandler"
how can i define for a search to detect the language of words to choose
which field_langid use.

my conf:

<updateRequestProcessorChain name="langid">
       <processor
class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
   <lst name="defaults">
     <bool name="langid">true</bool>
     <str name="langid.fl">NomDocument,ContenuDocument,Postit,
 </str>
         <str name="langid.langField">language_s</str>
         <str name="langid.whitelist">en,fr,ar</str>
         <str name="langid.fallback">fr</str>
         <float name="langid.threshold">0.6</float>
         <bool name="langid.map">true</bool> 	
         <bool name="langid.map.individual">true</bool>		
         <bool name="langid.map.keepOrig">true</bool>
		  	
   </lst>
</processor>

<field name="AllChamp_ar" type="text_ar" multiValued="true" indexed="true"
required="false" stored="false"/>
<field name="AllChamp_fr" type="text_fr" multiValued="true" indexed="true"
required="false" stored="false"/>
<field name="AllChamp_en" type="text_en" multiValued="true" indexed="true"
required="false" stored="false"/>

<dynamicField name="*_en" type="text_en" indexed="true" stored="false"
required="false" multiValued="true"/>
<dynamicField name="*_fr" type="text_fr" indexed="true" stored="false"
required="false" multiValued="true"/>
<dynamicField name="*_ar" type="text_ar" indexed="true" stored="false"
required="false" multiValued="true"/>

<copyField source="*_ar" dest="AllChamp_ar"/>
<copyField source="*_fr" dest="AllChamp_fr"/>
<copyField source="*_en" dest="AllChamp_en"/>

<requestHandler name="/select" class="solr.SearchHandler">
     <lst name="defaults">
       <str name="echoParams">explicit</str>
       <int name="rows">10</int>
	   <str name="defType">edismax</str>
       <str name="qf">
	   AllChamp^2.0 AllChamp_ar^2.0 AllChamp_en^2.0 AllChamp_fr^5.0
	   </str>
     </lst>
</requestHandler>

exemple for search in Solr Admin:  "nous présentons" it is frensh language.
and "nous" is a stopwords_fr.
but when i search for "nous présontons" i find nous becaus i have some
english docs which contain "nous".

this is just one exemple for on language. i dont want to add stopwords_fr in
stopwords_en.
what i want is to detect the language before the select search then choose
the field_langid for search.

Best regards,
Anass BENJELLOUN








--
View this message in context: http://lucene.472066.n3.nabble.com/multilingual-search-tp4145639.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: multilingual search

Posted by Paul Libbrecht <pa...@hoplahup.net>.
> 1. Modify the qf parameter directly by either adding the "_xx" language suffix to each field in qf, or replacing the "xx" for any qf fields that already have an "_xx" suffix.
> 2. Have separate "qf_xx" parameters which are customized for specific languages and then copy the language-specific "qf_xx" parameter to the main qf parameter based on the language that is detected.

The mix I make there is a little more subtle.
For this I need to remove the earlier query component and call it wrapped around.
Anything that is not with a named field will have that multilingual expansion.
To do this: I call the query parser with a wild default field, then perform the expansion on the expanded query for any term query that has the wild field, these become a disjunction of the languages detected of interest, each analyzed for their language.

This is also a solution to perform exact/stemmed/phonetic fields, and, for example, prefer a match in the title to a match in the body.
This assumes, of course, that these fields exist in each language (and that metaphone works, say, for German, for which no evidence exists yet).

paul

Re: multilingual search

Posted by Jack Krupansky <ja...@basetechnology.com>.
Indeed, a Solr search component to customize the incoming query for query 
language can work as well. Add it to the search components before the 
"query" component, have it call the language detection code on the q 
parameter, and then modify the "qf" parameter based on the language 
discovered.

Two possible approaches come to mind:

1. Modify the qf parameter directly by either adding the "_xx" language 
suffix to each field in qf, or replacing the "xx" for any qf fields that 
already have an "_xx" suffix.
2. Have separate "qf_xx" parameters which are customized for specific 
languages and then copy the language-specific "qf_xx" parameter to the main 
qf parameter based on the language that is detected.

-- Jack Krupansky

-----Original Message----- 
From: Paul Libbrecht
Sent: Friday, July 4, 2014 11:36 AM
To: solr-user@lucene.apache.org
Subject: Re: multilingual search

To do just what Jack described, I often write a solr query component that 
does "query expansion".
Based on some parameters I can recognize to be a language hint (e.g. the 
language of the environment they search in, the browser's accept-language) I 
reformulate the query into a query in the fields in these languages in a 
preference order.

I am sure that doing this produces some noise. E.g. because the search 
corpus is not uniformly spread, but… I have to accept it.

There are many other example's than the fine "raison d'être" example of Jack 
(I like particularly the way he describes the motivation to using it, I 
almost hear people trying to carefully articulate this! ;-)).
Other examples of language cross-use include the "gallicisms" e.g. in 
German: http://de.wikipedia.org/wiki/Liste_von_Gallizismen or other 
languages linked there.

E.g. "direction" which has a different meanings in French (where it can mean 
the management staff) and in English (where it can mean the teacher's 
instruction), "demonstration" too, "sitting" (which is an english word used 
in French).


paul

On 4 juil. 2014, at 17:15, "Jack Krupansky" <ja...@basetechnology.com> wrote:

> What leads you to believe that the user is not interested in occurrences 
> of the French phrase in English text? I mean, we English-speakers and 
> writers like to use French phrases to show how sophisticated we are! It's 
> part of our... raison d'être. If I do a Google search for "raison d'être", 
> it doesn't mysteriously show me only French documents.
>
> So, usually, it needs to be a user preference - the user's preferred 
> language, and whether they want to search across documents in all 
> languages or just a subset of languages. And then, on the results page you 
> can show the language and a button to restrict a re-query to the specific 
> language.
>
> If you really need to do this query language detection, the best approach 
> is to do it within your application layer (you can use the Google code for 
> language detection) and then send the query to the appropriate query 
> request handler, with a separate query request handler for each language 
> that optimizes the settings for that language, such as the 
> language-specific fields to use for the "qf" parameter.
>
> -- Jack Krupansky
>
> -----Original Message----- From: benjelloun
> Sent: Friday, July 4, 2014 10:52 AM
> To: solr-user@lucene.apache.org
> Subject: multilingual search
>
> Hello,
>
> what i need to do is to detect language of my fields then when i search 
> with
> "/select  RequestHandler"
> how can i define for a search to detect the language of words to choose
> which field_langid use.
>
> my conf:
>
> <updateRequestProcessorChain name="langid">
>      <processor
> class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
>  <lst name="defaults">
>    <bool name="langid">true</bool>
>    <str name="langid.fl">NomDocument,ContenuDocument,Postit,
> </str>
>        <str name="langid.langField">language_s</str>
>        <str name="langid.whitelist">en,fr,ar</str>
>        <str name="langid.fallback">fr</str>
>        <float name="langid.threshold">0.6</float>
>        <bool name="langid.map">true</bool>
>        <bool name="langid.map.individual">true</bool>
>        <bool name="langid.map.keepOrig">true</bool>
>
>  </lst>
> </processor>
>
> <field name="AllChamp_ar" type="text_ar" multiValued="true" indexed="true"
> required="false" stored="false"/>
> <field name="AllChamp_fr" type="text_fr" multiValued="true" indexed="true"
> required="false" stored="false"/>
> <field name="AllChamp_en" type="text_en" multiValued="true" indexed="true"
> required="false" stored="false"/>
>
> <dynamicField name="*_en" type="text_en" indexed="true" stored="false"
> required="false" multiValued="true"/>
> <dynamicField name="*_fr" type="text_fr" indexed="true" stored="false"
> required="false" multiValued="true"/>
> <dynamicField name="*_ar" type="text_ar" indexed="true" stored="false"
> required="false" multiValued="true"/>
>
> <copyField source="*_ar" dest="AllChamp_ar"/>
> <copyField source="*_fr" dest="AllChamp_fr"/>
> <copyField source="*_en" dest="AllChamp_en"/>
>
> <requestHandler name="/select" class="solr.SearchHandler">
>    <lst name="defaults">
>      <str name="echoParams">explicit</str>
>      <int name="rows">10</int>
>  <str name="defType">edismax</str>
>      <str name="qf">
>  AllChamp^2.0 AllChamp_ar^2.0 AllChamp_en^2.0 AllChamp_fr^5.0
>  </str>
>    </lst>
> </requestHandler>
>
> exemple for search in Solr Admin:  "nous présentons" it is frensh 
> language.
> and "nous" is a stopwords_fr.
> but when i search for "nous présontons" i find nous becaus i have some
> english docs which contain "nous".
>
> this is just one exemple for on language. i dont want to add stopwords_fr 
> in
> stopwords_en.
> what i want is to detect the language before the select search then choose
> the field_langid for search.
>
> Best regards,
> Anass BENJELLOUN
>
>
>
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/multilingual-search-tp4145639.html
> Sent from the Solr - User mailing list archive at Nabble.com. 


Re: multilingual search

Posted by Paul Libbrecht <pa...@hoplahup.net>.
To do just what Jack described, I often write a solr query component that does "query expansion".
Based on some parameters I can recognize to be a language hint (e.g. the language of the environment they search in, the browser's accept-language) I reformulate the query into a query in the fields in these languages in a preference order.

I am sure that doing this produces some noise. E.g. because the search corpus is not uniformly spread, but… I have to accept it.

There are many other example's than the fine "raison d'être" example of Jack (I like particularly the way he describes the motivation to using it, I almost hear people trying to carefully articulate this! ;-)).
Other examples of language cross-use include the "gallicisms" e.g. in German: http://de.wikipedia.org/wiki/Liste_von_Gallizismen or other languages linked there.

E.g. "direction" which has a different meanings in French (where it can mean the management staff) and in English (where it can mean the teacher's instruction), "demonstration" too, "sitting" (which is an english word used in French). 


paul

On 4 juil. 2014, at 17:15, "Jack Krupansky" <ja...@basetechnology.com> wrote:

> What leads you to believe that the user is not interested in occurrences of the French phrase in English text? I mean, we English-speakers and writers like to use French phrases to show how sophisticated we are! It's part of our... raison d'être. If I do a Google search for "raison d'être", it doesn't mysteriously show me only French documents.
> 
> So, usually, it needs to be a user preference - the user's preferred language, and whether they want to search across documents in all languages or just a subset of languages. And then, on the results page you can show the language and a button to restrict a re-query to the specific language.
> 
> If you really need to do this query language detection, the best approach is to do it within your application layer (you can use the Google code for language detection) and then send the query to the appropriate query request handler, with a separate query request handler for each language that optimizes the settings for that language, such as the language-specific fields to use for the "qf" parameter.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: benjelloun
> Sent: Friday, July 4, 2014 10:52 AM
> To: solr-user@lucene.apache.org
> Subject: multilingual search
> 
> Hello,
> 
> what i need to do is to detect language of my fields then when i search with
> "/select  RequestHandler"
> how can i define for a search to detect the language of words to choose
> which field_langid use.
> 
> my conf:
> 
> <updateRequestProcessorChain name="langid">
>      <processor
> class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
>  <lst name="defaults">
>    <bool name="langid">true</bool>
>    <str name="langid.fl">NomDocument,ContenuDocument,Postit,
> </str>
>        <str name="langid.langField">language_s</str>
>        <str name="langid.whitelist">en,fr,ar</str>
>        <str name="langid.fallback">fr</str>
>        <float name="langid.threshold">0.6</float>
>        <bool name="langid.map">true</bool>
>        <bool name="langid.map.individual">true</bool>
>        <bool name="langid.map.keepOrig">true</bool>
> 
>  </lst>
> </processor>
> 
> <field name="AllChamp_ar" type="text_ar" multiValued="true" indexed="true"
> required="false" stored="false"/>
> <field name="AllChamp_fr" type="text_fr" multiValued="true" indexed="true"
> required="false" stored="false"/>
> <field name="AllChamp_en" type="text_en" multiValued="true" indexed="true"
> required="false" stored="false"/>
> 
> <dynamicField name="*_en" type="text_en" indexed="true" stored="false"
> required="false" multiValued="true"/>
> <dynamicField name="*_fr" type="text_fr" indexed="true" stored="false"
> required="false" multiValued="true"/>
> <dynamicField name="*_ar" type="text_ar" indexed="true" stored="false"
> required="false" multiValued="true"/>
> 
> <copyField source="*_ar" dest="AllChamp_ar"/>
> <copyField source="*_fr" dest="AllChamp_fr"/>
> <copyField source="*_en" dest="AllChamp_en"/>
> 
> <requestHandler name="/select" class="solr.SearchHandler">
>    <lst name="defaults">
>      <str name="echoParams">explicit</str>
>      <int name="rows">10</int>
>  <str name="defType">edismax</str>
>      <str name="qf">
>  AllChamp^2.0 AllChamp_ar^2.0 AllChamp_en^2.0 AllChamp_fr^5.0
>  </str>
>    </lst>
> </requestHandler>
> 
> exemple for search in Solr Admin:  "nous présentons" it is frensh language.
> and "nous" is a stopwords_fr.
> but when i search for "nous présontons" i find nous becaus i have some
> english docs which contain "nous".
> 
> this is just one exemple for on language. i dont want to add stopwords_fr in
> stopwords_en.
> what i want is to detect the language before the select search then choose
> the field_langid for search.
> 
> Best regards,
> Anass BENJELLOUN
> 
> 
> 
> 
> 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/multilingual-search-tp4145639.html
> Sent from the Solr - User mailing list archive at Nabble.com. 


Re: multilingual search

Posted by Jack Krupansky <ja...@basetechnology.com>.
What leads you to believe that the user is not interested in occurrences of 
the French phrase in English text? I mean, we English-speakers and writers 
like to use French phrases to show how sophisticated we are! It's part of 
our... raison d'être. If I do a Google search for "raison d'être", it 
doesn't mysteriously show me only French documents.

So, usually, it needs to be a user preference - the user's preferred 
language, and whether they want to search across documents in all languages 
or just a subset of languages. And then, on the results page you can show 
the language and a button to restrict a re-query to the specific language.

If you really need to do this query language detection, the best approach is 
to do it within your application layer (you can use the Google code for 
language detection) and then send the query to the appropriate query request 
handler, with a separate query request handler for each language that 
optimizes the settings for that language, such as the language-specific 
fields to use for the "qf" parameter.

-- Jack Krupansky

-----Original Message----- 
From: benjelloun
Sent: Friday, July 4, 2014 10:52 AM
To: solr-user@lucene.apache.org
Subject: multilingual search

Hello,

what i need to do is to detect language of my fields then when i search with
"/select  RequestHandler"
how can i define for a search to detect the language of words to choose
which field_langid use.

my conf:

<updateRequestProcessorChain name="langid">
       <processor
class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
   <lst name="defaults">
     <bool name="langid">true</bool>
     <str name="langid.fl">NomDocument,ContenuDocument,Postit,
</str>
         <str name="langid.langField">language_s</str>
         <str name="langid.whitelist">en,fr,ar</str>
         <str name="langid.fallback">fr</str>
         <float name="langid.threshold">0.6</float>
         <bool name="langid.map">true</bool>
         <bool name="langid.map.individual">true</bool>
         <bool name="langid.map.keepOrig">true</bool>

   </lst>
</processor>

<field name="AllChamp_ar" type="text_ar" multiValued="true" indexed="true"
required="false" stored="false"/>
<field name="AllChamp_fr" type="text_fr" multiValued="true" indexed="true"
required="false" stored="false"/>
<field name="AllChamp_en" type="text_en" multiValued="true" indexed="true"
required="false" stored="false"/>

<dynamicField name="*_en" type="text_en" indexed="true" stored="false"
required="false" multiValued="true"/>
<dynamicField name="*_fr" type="text_fr" indexed="true" stored="false"
required="false" multiValued="true"/>
<dynamicField name="*_ar" type="text_ar" indexed="true" stored="false"
required="false" multiValued="true"/>

<copyField source="*_ar" dest="AllChamp_ar"/>
<copyField source="*_fr" dest="AllChamp_fr"/>
<copyField source="*_en" dest="AllChamp_en"/>

<requestHandler name="/select" class="solr.SearchHandler">
     <lst name="defaults">
       <str name="echoParams">explicit</str>
       <int name="rows">10</int>
   <str name="defType">edismax</str>
       <str name="qf">
   AllChamp^2.0 AllChamp_ar^2.0 AllChamp_en^2.0 AllChamp_fr^5.0
   </str>
     </lst>
</requestHandler>

exemple for search in Solr Admin:  "nous présentons" it is frensh language.
and "nous" is a stopwords_fr.
but when i search for "nous présontons" i find nous becaus i have some
english docs which contain "nous".

this is just one exemple for on language. i dont want to add stopwords_fr in
stopwords_en.
what i want is to detect the language before the select search then choose
the field_langid for search.

Best regards,
Anass BENJELLOUN








--
View this message in context: 
http://lucene.472066.n3.nabble.com/multilingual-search-tp4145639.html
Sent from the Solr - User mailing list archive at Nabble.com.