You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Will Martin <wi...@gmail.com> on 2017/01/24 06:37:56 UTC

Language Detection Individual Field Mapping Bug

Hello,

While using Solr 6.0.4 I noticed that the
org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor
has a bug in it where it does not respect the "langid.map.individual"
parameter in solrconfig.xml. The documentation for langid.map.individual
<https://wiki.apache.org/solr/LanguageDetection#langid.map.individual>
specifies:

If you require detecting languages separately for each field, supply
> langid.map.individual=true. The supplied fields will then be renamed
> according to detected language on an individual field basis.
>

However, when this field is set to "true" the fields are still mapped to
the language code of the entire document. For example: With the following
snippet from solrconfig.xml

<processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
   <lst name="defaults">
     <str name="langid.fl">title,text</str>
     <str name="langid.langField">language_s</str>
     <bool name="langid.map">true</bool>
     <bool name="langid.map.individual">true</bool>
   </lst></processor>

a document that takes the form

{
  "title": "This is an English title",
  "text": "Pero el texto de este documento está en español."
}

will be turned into

{
  "title_es": "This is an english title",
  "text_es": "Pero el texto de este documento está en español.",
  "language_s": ["es"]
}

rather than

{
  "title_en": "This is an english title",
  "text_es": "Pero el texto de este documento está en español.",
  "language_s": ["es","en"]
}

during processing.

This bug seems to have been introduced in SOLR-3881
<https://issues.apache.org/jira/browse/SOLR-3881> when the abstract method
(LangDetectLanguageIdentifierUpdateProcessor.java:52)

protected List<DetectedLanguage> detectLanguage(String content)

was changed to the signature

protected List<DetectedLanguage> detectLanguage(SolrInputDocument doc)

which does not allow one to recognize individual fields while preforming
language detection. As it stands, the entire document is analysed per
individual field (included in the "langid.fl" or "langid.map.individual.fl"
parameters) and the field is mapped to the language of the entire document.

I searched the Apache Jira for a ticket tracking this bug but did not find
anything that seemed related. I thought before filing a new ticket I would
ping this mailing list to see if anyone knows about work relating to this
issue or if there is already a ticket for it (not directly related to the
term "langid.map.individual" perhaps). If not I can go ahead and file the
ticket.


Thanks,

-William Martin

Re: Language Detection Individual Field Mapping Bug

Posted by Tomás Fernández Löbbe <to...@gmail.com>.

Thanks Will,
This does look like a bug and I also couldn't find a Jira issue for it.
Feel free to create one.

Tomás

On Mon, Jan 23, 2017 at 10:37 PM, Will Martin <wi...@gmail.com>
wrote:

> Hello,
>
> While using Solr 6.0.4 I noticed that the org.apache.solr.update.
> processor.LangDetectLanguageIdentifierUpdateProcessor has a bug in it
> where it does not respect the "langid.map.individual" parameter in
> solrconfig.xml. The documentation for langid.map.individual
> <https://wiki.apache.org/solr/LanguageDetection#langid.map.individual>
> specifies:
>
> If you require detecting languages separately for each field, supply
>> langid.map.individual=true. The supplied fields will then be renamed
>> according to detected language on an individual field basis.
>>
>
> However, when this field is set to "true" the fields are still mapped to
> the language code of the entire document. For example: With the following
> snippet from solrconfig.xml
>
> <processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
>    <lst name="defaults">
>      <str name="langid.fl">title,text</str>
>      <str name="langid.langField">language_s</str>
>      <bool name="langid.map">true</bool>
>      <bool name="langid.map.individual">true</bool>
>    </lst></processor>
>
> a document that takes the form
>
> {
>   "title": "This is an English title",
>   "text": "Pero el texto de este documento está en español."
> }
>
> will be turned into
>
> {
>   "title_es": "This is an english title",
>   "text_es": "Pero el texto de este documento está en español.",
>   "language_s": ["es"]
> }
>
> rather than
>
> {
>   "title_en": "This is an english title",
>   "text_es": "Pero el texto de este documento está en español.",
>   "language_s": ["es","en"]
> }
>
> during processing.
>
> This bug seems to have been introduced in SOLR-3881
> <https://issues.apache.org/jira/browse/SOLR-3881> when the abstract
> method (LangDetectLanguageIdentifierUpdateProcessor.java:52)
>
> protected List<DetectedLanguage> detectLanguage(String content)
>
> was changed to the signature
>
> protected List<DetectedLanguage> detectLanguage(SolrInputDocument doc)
>
> which does not allow one to recognize individual fields while preforming
> language detection. As it stands, the entire document is analysed per
> individual field (included in the "langid.fl" or "langid.map.individual.fl"
> parameters) and the field is mapped to the language of the entire document.
>
> I searched the Apache Jira for a ticket tracking this bug but did not find
> anything that seemed related. I thought before filing a new ticket I would
> ping this mailing list to see if anyone knows about work relating to this
> issue or if there is already a ticket for it (not directly related to the
> term "langid.map.individual" perhaps). If not I can go ahead and file the
> ticket.
>
>
> Thanks,
>
> -William Martin
>