You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Müller, Stephan <Mu...@ponton-consulting.de> on 2013/11/27 11:02:59 UTC

LanguageIdentifierUpdateProcessor uses only firstValue() on multivalued fields

Hello,

this is a repost. This message was originally posted on the 'general' list but it was suggested, that the 'user' list might be a better place to ask.

---- Original Message ----
Hi,

we are passing a multivalued field to the LanguageIdentifierUpdateProcessor. This multivalued field contains arbitrary types (Integer, String, Date).

Now, the LanguageIdentifierUpdateProcessor.concatFields(SolrInputDocument doc, String[] fields), which btw does not use the parameter fields, is unable to parse all fields of the/a multivalued field. The call "Object content = doc.getFieldValue(fieldName);" does not care what type the field is and just delegates to SolrInputDocument which in turn calls getFirstValue.

So, two issues:
First - if the first value of the multivalued field is not of type String, the field is ignored completely.

Second - the concat method does not concat all values of a multivalued field.

While http://www.mail-archive.com/solr-user@lucene.apache.org/msg90530.html states: "The feature is designed to detect exactly one language per field. In case of multivalued, it will concatenate all values before detection." But as far as I can see, the code is unable to do this at all for multivalued fields.

This behavior was found in 4.3 but the code is still the same for current trunk (as of 2013-11-26)

Is this a bug? Is this a special design decision? Did we miss a certain configuration, that would allow the Language identification to use all values of a multivalued field?

We are about to write our own LangDetectLanguageIdentifierUpdateProcessorFactory (why is the getInstance hardcoded to return LanguageIdentifierUpdateProcessor?) and overwrite LanguageIdentifierUpdateProcessor to handle all values of a multivalued field, ignoring non-string values.



Please see configuration below.

I hope I was able to make myself clear. I'd like to hear your thoughts on this, before I go off and file a bug report.

Regards,
Stephan


A little background:
We are using a 3rd-party CMS framework which pulls in some magic SOLR configuration (namely the textbody field).

The textbody field is defined as follows:
<!--
The default text search field.
This field and the field name_tokenized are used as default search fields
for the /editor and /cmdismax search request handlers in solrconfig.xml.

For the Content Feeder the text of all indexed fields of
the CoreMedia document is stored in this field.
The CAE Feeder by default stores the text of all elements in
this field.
-->
<field name="textbody" type="text_general" stored="false" multiValued="true"/>

As you can see, it is also used as search field, therefor we want to have the actual datatypes on the values.
The field itself is generated by a processor, prior to calling the language identification (see processor chain).



The processor chain:

<updateRequestProcessorChain>
  <!-- Improve error messages -->
  <processor class="3rdpartypackage.ErrorHandlingProcessorFactory" />

  <!-- Blob extraction -->
  <processor class="3rdpartypackage.BinaryDataProcessorFactory">
    <!-- some comments -->
  </processor>

  <!-- Textbody handling -->
  <processor class="3rdpartypackage.TextBodyProcessorFactory" />

  <!-- Copy content of field name to name_tokenized -->
  <processor class="solr.CloneFieldUpdateProcessorFactory">
    <str name="source">name</str>
    <str name="dest">name_tokenized</str>
  </processor>

  <!--Language detection -->
  <processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
    <str name="langid.fl">textbody,name_tokenized</str>
    <str name="langid.langField">language</str>
    <str name="langid.fallback">en</str>
  </processor>
  
  <!-- Index into language dependent fields if defined (e.g. textbody_en instead of textbody) -->
  <processor class="3rdpartypackage.solr.update.processor.LanguageDependentFieldsProcessorFactory">
    <str name="languageField">language</str>
    <str name="textFields">textbody,name_tokenized</str>
  </processor>

  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>

Re: LanguageIdentifierUpdateProcessor uses only firstValue() on multivalued fields

Posted by Jack Krupansky <ja...@basetechnology.com>.

I suspect that it is an oversight for a use case that was not considered. I 
mean, it should probably either ignore or convert non text/string values. 
Hmmm... are you using JSON input? I mean, how are the types being set? Solr 
XML doesn't have a way to set the value types.

You could workaround it with an update processor that copied the field and 
massaged the multiple values into what you really want the language 
detection to see. You could even implement that processor as a JavaScript 
script with the stateless script update processor.

-- Jack Krupansky

-----Original Message----- 
From: Müller, Stephan
Sent: Wednesday, November 27, 2013 5:02 AM
To: solr-user@lucene.apache.org
Subject: LanguageIdentifierUpdateProcessor uses only firstValue() on 
multivalued fields

Hello,

this is a repost. This message was originally posted on the 'general' list 
but it was suggested, that the 'user' list might be a better place to ask.

---- Original Message ----
Hi,

we are passing a multivalued field to the LanguageIdentifierUpdateProcessor. 
This multivalued field contains arbitrary types (Integer, String, Date).

Now, the LanguageIdentifierUpdateProcessor.concatFields(SolrInputDocument 
doc, String[] fields), which btw does not use the parameter fields, is 
unable to parse all fields of the/a multivalued field. The call "Object 
content = doc.getFieldValue(fieldName);" does not care what type the field 
is and just delegates to SolrInputDocument which in turn calls 
getFirstValue.

So, two issues:
First - if the first value of the multivalued field is not of type String, 
the field is ignored completely.

Second - the concat method does not concat all values of a multivalued 
field.

While http://www.mail-archive.com/solr-user@lucene.apache.org/msg90530.html 
states: "The feature is designed to detect exactly one language per field. 
In case of multivalued, it will concatenate all values before detection." 
But as far as I can see, the code is unable to do this at all for 
multivalued fields.

This behavior was found in 4.3 but the code is still the same for current 
trunk (as of 2013-11-26)

Is this a bug? Is this a special design decision? Did we miss a certain 
configuration, that would allow the Language identification to use all 
values of a multivalued field?

We are about to write our own 
LangDetectLanguageIdentifierUpdateProcessorFactory (why is the getInstance 
hardcoded to return LanguageIdentifierUpdateProcessor?) and overwrite 
LanguageIdentifierUpdateProcessor to handle all values of a multivalued 
field, ignoring non-string values.



Please see configuration below.

I hope I was able to make myself clear. I'd like to hear your thoughts on 
this, before I go off and file a bug report.

Regards,
Stephan


A little background:
We are using a 3rd-party CMS framework which pulls in some magic SOLR 
configuration (namely the textbody field).

The textbody field is defined as follows:
<!--
The default text search field.
This field and the field name_tokenized are used as default search fields
for the /editor and /cmdismax search request handlers in solrconfig.xml.

For the Content Feeder the text of all indexed fields of
the CoreMedia document is stored in this field.
The CAE Feeder by default stores the text of all elements in
this field.
-->
<field name="textbody" type="text_general" stored="false" 
multiValued="true"/>

As you can see, it is also used as search field, therefor we want to have 
the actual datatypes on the values.
The field itself is generated by a processor, prior to calling the language 
identification (see processor chain).



The processor chain:

<updateRequestProcessorChain>
  <!-- Improve error messages -->
  <processor class="3rdpartypackage.ErrorHandlingProcessorFactory" />

  <!-- Blob extraction -->
  <processor class="3rdpartypackage.BinaryDataProcessorFactory">
    <!-- some comments -->
  </processor>

  <!-- Textbody handling -->
  <processor class="3rdpartypackage.TextBodyProcessorFactory" />

  <!-- Copy content of field name to name_tokenized -->
  <processor class="solr.CloneFieldUpdateProcessorFactory">
    <str name="source">name</str>
    <str name="dest">name_tokenized</str>
  </processor>

  <!--Language detection -->
  <processor 
class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
    <str name="langid.fl">textbody,name_tokenized</str>
    <str name="langid.langField">language</str>
    <str name="langid.fallback">en</str>
  </processor>

  <!-- Index into language dependent fields if defined (e.g. textbody_en 
instead of textbody) -->
  <processor 
class="3rdpartypackage.solr.update.processor.LanguageDependentFieldsProcessorFactory">
    <str name="languageField">language</str>
    <str name="textFields">textbody,name_tokenized</str>
  </processor>

  <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>