You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Aaron Daubman <da...@gmail.com> on 2012/06/05 04:17:03 UTC

Correct way to deal with source data that may include a multivalued field that needs to be used for sorting?

Greetings,

I have "dirty" source data where some documents being indexed, although
unlikely, may contain multivalued fields that are also required for
sorting. In previous versions of Solr, sorting on this field worked fine
(possibly because few or no multivalued fields were ever encountered?),
however, as of 3.6.0, thanks to
https://issues.apache.org/jira/browse/SOLR-2339 attempting to sort on this
field now throws an error:

[2012-06-04 17:20:01,691] ERROR org.apache.solr.common.SolrException
org.apache.solr.common.SolrException: can not sort on multivalued field:
f_normalizedValue

The relevant bits of the schema.xml are:
<fieldType name="sfloat" class="solr.TrieFloatField" precisionStep="0"
positionIncrementGap="0" sortMissingLast="true"/>
<dynamicField name="f_*" type="sfloat" indexed="true" stored="true"
required="false" multiValued="true"/>

Assuming that the source documents being indexed cannot be changed (which,
at least for now, they cannot), what would be the next best way to allow
for both the possibility of multiple f_normalizedValue fields appearing in
indexed documents, as wel as being able to sort by f_normalizedValue?

Thank you,
     Aaron

Re: Correct way to deal with source data that may include a multivalued field that needs to be used for sorting?

Posted by Chris Hostetter <ho...@fucit.org>.

: In this day and age, a custom update handler is almost never the right
: > answer to a problem -- nor is a custom request handler that does updates
: > (theose two things are actaully different) ... my advice is always to
: > start by trying to impliment what you need as an UpdateRequestProcessor,
: > and if that doesn't work out then refactor your code to be a Request
: > Handler instead.
: >
: 
: e.g. benefits of UpdateRequestProcessor over custom update handler?

purely fro ma code reuse standpoint.  Request Handler is really the 
coarsest, broadest, level of plugin you can implement.  You can write one 
that does almost anything, but that requires you to do everything 
yourself.

writing an UpdateRequestProcessor instead of a Request Handler lets you 
re-use your customiations with any Request Hanlder, and it's lets you mix 
and match the ordering w/ other Update Processors (instead of it being 
in your handler where you have to do all your special stuff before you 
call out to the processor chain) and makes it usable regardless of wether 
your documents are coming from the XmlUpdateRequestHandler or DIH, or 
whatever.


-Hoss

Re: Correct way to deal with source data that may include a multivalued field that needs to be used for sorting?

Posted by Aaron Daubman <da...@gmail.com>.

While I look into doing some refactoring, as well as creating some new
UpdateRequestProcessors (and/or backporting), would you please point me to
some reading material on why you say the following:

In this day and age, a custom update handler is almost never the right
> answer to a problem -- nor is a custom request handler that does updates
> (theose two things are actaully different) ... my advice is always to
> start by trying to impliment what you need as an UpdateRequestProcessor,
> and if that doesn't work out then refactor your code to be a Request
> Handler instead.
>

e.g. benefits of UpdateRequestProcessor over custom update handler?

Thanks again for the great pointers,
      Aaron

Re: Correct way to deal with source data that may include a multivalued field that needs to be used for sorting?

Posted by Chris Hostetter <ho...@fucit.org>.

: The new FieldValueSubsetUpdateProcessorFactory classes look phenomenal. I
: haven't looked yet, but what are the chances these will be back-ported to
: 3.6 (or how hard would it be to backport them?)... I'll have to check out
: the source in more detail.

3.x is bug fix only as we now focus on 4.0 ... but these particular 
classes are fairly straight foward and isolated should be realtively easy 
for someoen with java knowledge to backport to 3.6

: If stuck on 3.6, what would be the best way to deal with this situation?
: It's currently looking like it will have to be a custom update handler, but

In this day and age, a custom update handler is almost never the right 
answer to a problem -- nor is a custom request handler that does updates 
(theose two things are actaully different) ... my advice is always to 
start by trying to impliment what you need as an UpdateRequestProcessor, 
and if that doesn't work out then refactor your code to be a Request 
Handler instead.

-Hoss

Re: Correct way to deal with source data that may include a multivalued field that needs to be used for sorting?

Posted by Aaron Daubman <da...@gmail.com>.

Hoss,

The new FieldValueSubsetUpdateProcessorFactory classes look phenomenal. I
haven't looked yet, but what are the chances these will be back-ported to
3.6 (or how hard would it be to backport them?)... I'll have to check out
the source in more detail.

If stuck on 3.6, what would be the best way to deal with this situation?
It's currently looking like it will have to be a custom update handler, but
I'd hate to have to go down this route if there are more future-proof
options.

Thanks again,
     Aaron

On Tue, Jun 5, 2012 at 6:53 PM, Chris Hostetter <ho...@fucit.org>wrote:

>
> : The real issue here is that the docs are created externally, and the
> : producer won't (yet) guarantee that fields that should appear once will
> : actually appear once. Because of this, I don't want to declare the field
> as
> : multiValued="false" as I don't want to cause indexing errors. It would be
> : great for me (and apparently many others after searching) if there were
> an
> : option as simple as forceSingleValued="true" - where some deterministic
> : behavior such as "use first field encountered, ignore all others", would
> : occur.
>
> This will be trivial in Solr 4.0, using one of the new
> "FieldValueSubsetUpdateProcessorFactory" classes that are now available --
> just pick your rule...
>
>
> https://builds.apache.org/view/G-L/view/Lucene/job/Solr-trunk/javadoc/org/apache/solr/update/processor/FieldValueSubsetUpdateProcessorFactory.html
> Direct Known Subclasses:
>    FirstFieldValueUpdateProcessorFactory,
>    LastFieldValueUpdateProcessorFactory,
>    MaxFieldValueUpdateProcessorFactory,
>    MinFieldValueUpdateProcessorFactory
>
> -Hoss
>

Re: Correct way to deal with source data that may include a multivalued field that needs to be used for sorting?

Posted by Chris Hostetter <ho...@fucit.org>.

: The real issue here is that the docs are created externally, and the
: producer won't (yet) guarantee that fields that should appear once will
: actually appear once. Because of this, I don't want to declare the field as
: multiValued="false" as I don't want to cause indexing errors. It would be
: great for me (and apparently many others after searching) if there were an
: option as simple as forceSingleValued="true" - where some deterministic
: behavior such as "use first field encountered, ignore all others", would
: occur.

This will be trivial in Solr 4.0, using one of the new 
"FieldValueSubsetUpdateProcessorFactory" classes that are now available -- 
just pick your rule... 

https://builds.apache.org/view/G-L/view/Lucene/job/Solr-trunk/javadoc/org/apache/solr/update/processor/FieldValueSubsetUpdateProcessorFactory.html
Direct Known Subclasses:
    FirstFieldValueUpdateProcessorFactory, 
    LastFieldValueUpdateProcessorFactory, 
    MaxFieldValueUpdateProcessorFactory, 
    MinFieldValueUpdateProcessorFactory 

-Hoss

Re: Correct way to deal with source data that may include a multivalued field that needs to be used for sorting?

Posted by Aaron Daubman <da...@gmail.com>.

Thanks for the responses,

By saying "dirty data" you imply that only one of the values is "good" or
> "clean" and that the others can be safely discarded/ignored, as opposed to
> true multi-valued data where each value is there for good reason and needs
> to be preserved. In any case, how do you know/decide which value should be
> used for sorting - and did you just get lucky that Solr happened to use the
> right one?
>

I haven't gone back and checked the old version's docs where this was
"working", however, I suspect that either the field never ended up
appearing in docs more than once, or if it did, it had the same value
repeated...

The real issue here is that the docs are created externally, and the
producer won't (yet) guarantee that fields that should appear once will
actually appear once. Because of this, I don't want to declare the field as
multiValued="false" as I don't want to cause indexing errors. It would be
great for me (and apparently many others after searching) if there were an
option as simple as forceSingleValued="true" - where some deterministic
behavior such as "use first field encountered, ignore all others", would
occur.


The preferred technique would be the preprocess and "clean" the data before
> it is handed to Solr or SolrJ, even if the source must remain "dirty".
> Baring that a preprocessor or a custom update processor certainly.
>

I could write preprocessors (this is really what will eventually happen
when the producer cleans their data),  custom processors, etc... however,
for something this simple it would be great not to be producing more code
that would have to be maintained.



> Please clarify exactly how the data is being fed into Solr.
>

 I am using "generic" code to read from a key/value store and compose
documents. This is another reason fixing the data at this point would not
be desirable, the currently generic code would need to be made specific to
look for these particular fields and then coerce them to single values...

Thanks again,
      Aaron

Re: Correct way to deal with source data that may include a multivalued field that needs to be used for sorting?

Posted by Jack Krupansky <ja...@basetechnology.com>.

By saying "dirty data" you imply that only one of the values is "good" or 
"clean" and that the others can be safely discarded/ignored, as opposed to 
true multi-valued data where each value is there for good reason and needs 
to be preserved. In any case, how do you know/decide which value should be 
used for sorting - and did you just get lucky that Solr happened to use the 
right one?

The preferred technique would be the preprocess and "clean" the data before 
it is handed to Solr or SolrJ, even if the source must remain "dirty". 
Baring that a preprocessor or a custom update processor certainly.

Please clarify exactly how the data is being fed into Solr.

And if you really do need to preserve the multiple values, simply store them 
in a separate field that is not sorted. An update processor can do this as 
well.

-- Jack Krupansky

-----Original Message----- 
From: Erick Erickson
Sent: Tuesday, June 05, 2012 6:34 AM
To: solr-user@lucene.apache.org
Subject: Re: Correct way to deal with source data that may include a 
multivalued field that needs to be used for sorting?

Older versions of Solr didn't really sort correctly on multivalued fields, 
they
just didn't complain <G>.....

Hmmm. Off the top of my head, you can:
1> You don't say what the documents to be indexed are. Are they Solr-style
     documents on disk or do you process them with, say, a SolrJ program?
     If the latter, you can simply inspect them as you construct them and 
decide
     which of the multi-valued field values you want to use to sort
and copy that
     single value into a new field and sort on that.
2> You could write a custom 
UpdateRequestProcessorFactory/UpdateRequestProcessor
     pair and do the same thing in the processAdd method.

Best
Erick

On Mon, Jun 4, 2012 at 10:17 PM, Aaron Daubman <da...@gmail.com> wrote:
> Greetings,
>
> I have "dirty" source data where some documents being indexed, although
> unlikely, may contain multivalued fields that are also required for
> sorting. In previous versions of Solr, sorting on this field worked fine
> (possibly because few or no multivalued fields were ever encountered?),
> however, as of 3.6.0, thanks to
> https://issues.apache.org/jira/browse/SOLR-2339 attempting to sort on this
> field now throws an error:
>
> [2012-06-04 17:20:01,691] ERROR org.apache.solr.common.SolrException
> org.apache.solr.common.SolrException: can not sort on multivalued field:
> f_normalizedValue
>
> The relevant bits of the schema.xml are:
> <fieldType name="sfloat" class="solr.TrieFloatField" precisionStep="0"
> positionIncrementGap="0" sortMissingLast="true"/>
> <dynamicField name="f_*" type="sfloat" indexed="true" stored="true"
> required="false" multiValued="true"/>
>
> Assuming that the source documents being indexed cannot be changed (which,
> at least for now, they cannot), what would be the next best way to allow
> for both the possibility of multiple f_normalizedValue fields appearing in
> indexed documents, as wel as being able to sort by f_normalizedValue?
>
> Thank you,
>     Aaron

Re: Correct way to deal with source data that may include a multivalued field that needs to be used for sorting?

Posted by Erick Erickson <er...@gmail.com>.

Older versions of Solr didn't really sort correctly on multivalued fields, they
just didn't complain <G>.....

Hmmm. Off the top of my head, you can:
1> You don't say what the documents to be indexed are. Are they Solr-style
     documents on disk or do you process them with, say, a SolrJ program?
     If the latter, you can simply inspect them as you construct them and decide
     which of the multi-valued field values you want to use to sort
and copy that
     single value into a new field and sort on that.
2> You could write a custom UpdateRequestProcessorFactory/UpdateRequestProcessor
     pair and do the same thing in the processAdd method.

Best
Erick

On Mon, Jun 4, 2012 at 10:17 PM, Aaron Daubman <da...@gmail.com> wrote:
> Greetings,
>
> I have "dirty" source data where some documents being indexed, although
> unlikely, may contain multivalued fields that are also required for
> sorting. In previous versions of Solr, sorting on this field worked fine
> (possibly because few or no multivalued fields were ever encountered?),
> however, as of 3.6.0, thanks to
> https://issues.apache.org/jira/browse/SOLR-2339 attempting to sort on this
> field now throws an error:
>
> [2012-06-04 17:20:01,691] ERROR org.apache.solr.common.SolrException
> org.apache.solr.common.SolrException: can not sort on multivalued field:
> f_normalizedValue
>
> The relevant bits of the schema.xml are:
> <fieldType name="sfloat" class="solr.TrieFloatField" precisionStep="0"
> positionIncrementGap="0" sortMissingLast="true"/>
> <dynamicField name="f_*" type="sfloat" indexed="true" stored="true"
> required="false" multiValued="true"/>
>
> Assuming that the source documents being indexed cannot be changed (which,
> at least for now, they cannot), what would be the next best way to allow
> for both the possibility of multiple f_normalizedValue fields appearing in
> indexed documents, as wel as being able to sort by f_normalizedValue?
>
> Thank you,
>     Aaron