You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Trey Grainger <so...@gmail.com> on 2013/12/13 00:06:59 UTC

Re: Re: LanguageIdentifierUpdateProcessor uses only firstValue() on multivalued fields

Hmm... haven't run into the case where null was returned in a multi-valued
scenario yet... I probably just haven't tested that case.  I likely need to
add a null check there - thanks for pointing it out.

-Trey


On Fri, Nov 29, 2013 at 6:10 AM, Müller, Stephan <
Mueller@ponton-consulting.de> wrote:

> Hello Trey, thank you for this example.
>
> We've solved it by omitting the multivalued field and passing the distinct
> string fields instead, still I go with proposing a patch, so the language
> processor is able to concatenate multivalues by default. I think it's a
> reasonable feature (and can't remember to have ever contributed a patch to
> an open source project)
> My thoughts on the patch implementation are quite the same as Yours,
> iterating on getValues(). I'll have this discussed in the dev-list and
> probably in JIRA.
>
>
> One thing: How do you guard against a possible NPE in line 129
> > (final Object inputValue : inputField.getValues()) {
>
> SolrInputField.getValues() will return NULL if the associated value was
> null. It does not create an empty Collection.
> That, btw, seems to be a minor bug in the javadoc, not stating that this
> method returns null.
>
>
> Regards,
> Stephan - srm
>
> [...]
>
> > The "langsToPrepend" variable above will contain a set of languages,
> where
> > detectLanguage was called separately for each value in the multivalued
> > field.  If you just want to concatenate all the values and detect
> > languages once (as opposed to only using the first value in the
> > multivalued field, like it does today), just concatenate each of the
> input
> > values in the first loop and call detectLanguage once at the end.
> >
> > I wrote code that does this for an example in the Solr in Action book.
> >  The particular example was detecting languages for each value in a
> > multivalued field and then pre-pending the language to the text for the
> > multivalued field (so the analyzer would know which stemmer to use, as
> > they were being dynamically substituted in based upon the language).  The
> > code is available here if you are interested:
> > https://github.com/treygrainger/solr-in-
> >
> action/blob/master/src/main/java/sia/ch14/MultiTextFieldLanguageIdentifier
> > UpdateProcessor.java
> >
> > Good luck!
> >
> > -Trey
> >
> >
> >
> >
> > On Wed, Nov 27, 2013 at 10:16 AM, Müller, Stephan < Mueller@ponton-
> > consulting.de> wrote:
> >
> > > > I suspect that it is an oversight for a use case that was not
> > considered.
> > > > I mean, it should probably either ignore or convert non text/string
> > > > values.
> > > Ok, I'll see that I provide a patch against trunk. It actually ignores
> > > non string values, but is unable to check the remaining values of a
> > > multivalued field.
> > >
> > > > Hmmm... are you using JSON input? I mean, how are the types being
> set?
> > > > Solr XML doesn't have a way to set the value types.
> > > >
> > > No. It's a field with multivalued=true. That results in a
> > > SolrInputField where value (which is defined to be Object) actually
> > holds a List.
> > > This list is populated with Integer, String, Date, you name it.
> > > I'm talking about the actual Java-Datatypes. The values in the list
> > > are probably set by this 3rdparty Textbodyprocessor thingy.
> > >
> > > Now the Language processor just asks for field.getValue().
> > > This is delegated to the SolrInputField which in turn calls
> > > firstValue() Interestingly enough, already is able to handle a
> > Collection as its value.
> > > But if the value is a collection, it just returns the first element.
> > >
> > > > You could workaround it with an update processor that copied the
> > > > field
> > > and
> > > > massaged the multiple values into what you really want the language
> > > > detection to see. You could even implement that processor as a
> > > > JavaScript script with the stateless script update processor.
> > > >
> > > Our workaround would be to not feed the multivalued field but only the
> > > String fields (which are also included in the multivalued field)
> > >
> > >
> > > Filing a Bug/Feature request and providing the patch will take some
> > > time as I haven't setup a fully working trunk in my IDEA installation.
> > > But I'm eager to do it :)
> > >
> > > Regards,
> > > Stephan
> > >
> > >
> > > > -- Jack Krupansky
> > > >
> > > > -----Original Message-----
> > > > From: Müller, Stephan
> > > > Sent: Wednesday, November 27, 2013 5:02 AM
> > > > To: solr-user@lucene.apache.org
> > > > Subject: LanguageIdentifierUpdateProcessor uses only firstValue() on
> > > > multivalued fields
> > > >
> > > > Hello,
> > > >
> > > > this is a repost. This message was originally posted on the 'general'
> > > list
> > > > but it was suggested, that the 'user' list might be a better place
> > > > to
> > > ask.
> > > >
> > > > ---- Original Message ----
> > > > Hi,
> > > >
> > > > we are passing a multivalued field to the
> > > > LanguageIdentifierUpdateProcessor.
> > > > This multivalued field contains arbitrary types (Integer, String,
> > Date).
> > > >
> > > > Now, the
> > > > LanguageIdentifierUpdateProcessor.concatFields(SolrInputDocument
> > > > doc, String[] fields), which btw does not use the parameter fields,
> > > > is unable to parse all fields of the/a multivalued field. The call
> > > > "Object content = doc.getFieldValue(fieldName);" does not care what
> > > > type the
> > > field
> > > > is and just delegates to SolrInputDocument which in turn calls
> > > > getFirstValue.
> > > >
> > > > So, two issues:
> > > > First - if the first value of the multivalued field is not of type
> > > String,
> > > > the field is ignored completely.
> > > >
> > > > Second - the concat method does not concat all values of a
> > > > multivalued field.
> > > >
> > > > While http://www.mail-archive.com/solr-
> > > > user@lucene.apache.org/msg90530.html
> > > > states: "The feature is designed to detect exactly one language per
> > > field.
> > > > In case of multivalued, it will concatenate all values before
> > detection."
> > > > But as far as I can see, the code is unable to do this at all for
> > > > multivalued fields.
> > > >
> > > > This behavior was found in 4.3 but the code is still the same for
> > > > current trunk (as of 2013-11-26)
> > > >
> > > > Is this a bug? Is this a special design decision? Did we miss a
> > > > certain configuration, that would allow the Language identification
> > > > to use all values of a multivalued field?
> > > >
> > > > We are about to write our own
> > > > LangDetectLanguageIdentifierUpdateProcessorFactory (why is the
> > > getInstance
> > > > hardcoded to return LanguageIdentifierUpdateProcessor?) and
> > > > overwrite LanguageIdentifierUpdateProcessor to handle all values of
> > > > a multivalued field, ignoring non-string values.
> > > >
> > > >
> > > >
> > > > Please see configuration below.
> > > >
> > > > I hope I was able to make myself clear. I'd like to hear your
> > > > thoughts on this, before I go off and file a bug report.
> > > >
> > > > Regards,
> > > > Stephan
> > > >
> > > >
> > > > A little background:
> > > > We are using a 3rd-party CMS framework which pulls in some magic
> > > > SOLR configuration (namely the textbody field).
> > > >
> > > > The textbody field is defined as follows:
> > > > <!--
> > > > The default text search field.
> > > > This field and the field name_tokenized are used as default search
> > > > fields for the /editor and /cmdismax search request handlers in
> > solrconfig.xml.
> > > >
> > > > For the Content Feeder the text of all indexed fields of the
> > > > CoreMedia document is stored in this field.
> > > > The CAE Feeder by default stores the text of all elements in this
> > field.
> > > > -->
> > > > <field name="textbody" type="text_general" stored="false"
> > > > multiValued="true"/>
> > > >
> > > > As you can see, it is also used as search field, therefor we want to
> > > > have the actual datatypes on the values.
> > > > The field itself is generated by a processor, prior to calling the
> > > > language identification (see processor chain).
> > > >
> > > >
> > > >
> > > > The processor chain:
> > > >
> > > > <updateRequestProcessorChain>
> > > >   <!-- Improve error messages -->
> > > >   <processor class="3rdpartypackage.ErrorHandlingProcessorFactory"
> > > > />
> > > >
> > > >   <!-- Blob extraction -->
> > > >   <processor class="3rdpartypackage.BinaryDataProcessorFactory">
> > > >     <!-- some comments -->
> > > >   </processor>
> > > >
> > > >   <!-- Textbody handling -->
> > > >   <processor class="3rdpartypackage.TextBodyProcessorFactory" />
> > > >
> > > >   <!-- Copy content of field name to name_tokenized -->
> > > >   <processor class="solr.CloneFieldUpdateProcessorFactory">
> > > >     <str name="source">name</str>
> > > >     <str name="dest">name_tokenized</str>
> > > >   </processor>
> > > >
> > > >   <!--Language detection -->
> > > >   <processor
> > > >
> > > class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUp
> > > date
> > > > ProcessorFactory">
> > > >     <str name="langid.fl">textbody,name_tokenized</str>
> > > >     <str name="langid.langField">language</str>
> > > >     <str name="langid.fallback">en</str>
> > > >   </processor>
> > > >
> > > >   <!-- Index into language dependent fields if defined (e.g.
> > > > textbody_en instead of textbody) -->
> > > >   <processor
> > > >
> > > class="3rdpartypackage.solr.update.processor.LanguageDependentFieldsPr
> > > oces
> > > > sorFactory">
> > > >     <str name="languageField">language</str>
> > > >     <str name="textFields">textbody,name_tokenized</str>
> > > >   </processor>
> > > >
> > > >   <processor class="solr.RunUpdateProcessorFactory" />
> > > > </updateRequestProcessorChain>
> > >
> > >
>