You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "harish.agarwal" <ha...@gmail.com> on 2011/03/15 17:01:29 UTC

Sorting on multiValued fields via function query

Hello,
I believe the most recent builds of Solr have started explicitly throwing an
error around sorting on multiValued fields.  I'd actually been sorting on
multiValued fields for some time without any problems before this, not sure
how Solr was able to handle this in the past...

In any case, I'd like to be able to sort on multiValued fields via a
function query, but keep getting the following error:
can not use FieldCache on multivalued field

I've tried using the function 'sum', 'max', and 'min' with the same result.  
Is there any way to sort on the maximum value, for instance, of a
multiValued field?

Thanks,
-Harish

--
View this message in context: http://lucene.472066.n3.nabble.com/Sorting-on-multiValued-fields-via-function-query-tp2681833p2681833.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Sorting on multiValued fields via function query

Posted by "Smiley, David W." <ds...@mitre.org>.
Heh heh, you say "it worked correctly for me" yet you didn't actually have multi-valued data ;-)  Funny.

The only solution right now is to store the max and min into indexed single-valued fields at index time.  This is pretty straight-forward to do.  Even if/when Solr supports sorting on a multi-valued field, I doubt it would perform as well as what I suggest.

~ David Smiley
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/


On Mar 16, 2011, at 10:16 AM, harish.agarwal wrote:

> Hi David,
> 
> It did seem to work correctly for me - we had it running on our production
> indexes for some time and we never noticed any strange sorting behavior. 
> However, many of our multiValued fields are single valued for the majority
> of documents in our index so we may not have noticed the incorrect sorting
> behaviors.
> 
> Regardless, I understand the reasoning behind the restriction, I'm
> interested in getting around it by using a functionQuery to reduce
> multiValued fields to a single value.  It sounds like this isn't possible,
> is that correct?  Ideally I'd like to sort by the maximum value on
> descending sorts and the minimum value on ascending sorts.  Is there any
> movement towards implementing this sort of behavior?
> 
> Best,
> -Harish
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Sorting-on-multiValued-fields-via-function-query-tp2681833p2688288.html
> Sent from the Solr - User mailing list archive at Nabble.com.






Re: Sorting on multiValued fields via function query

Posted by boneill42 <bo...@alumni.brown.edu>.

Was there a solution here?  Is there a ticket related to the sort=max(FIELD)
solution?

-brian

--
View this message in context: http://lucene.472066.n3.nabble.com/Sorting-on-multiValued-fields-via-function-query-tp2681833p3340145.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Sorting on multiValued fields via function query

Posted by Erick Erickson <er...@gmail.com>.
+1 for both Chris's and Yonik's comments.

On Thu, Mar 17, 2011 at 3:19 PM, Yonik Seeley
<yo...@lucidimagination.com> wrote:
> On Thu, Mar 17, 2011 at 2:12 PM, Chris Hostetter
> <ho...@fucit.org> wrote:
>> As the code stands now: we fail fast and let the person building hte index
>> make a decision.
>
> Indexing two fields when one could work is unfortunate though.
> I think what we should support (eventually) is a max() function will also
> work on a multi-valued field and select the maximum value (i.e. it will
> simply bypass the check for multi-valued fields).
>
> Then one can utilize sort-by-function to do
> sort=max(author) asc
>
> -Yonik
> http://lucidimagination.com
>

Re: Sorting on multiValued fields via function query

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Thu, Mar 17, 2011 at 2:12 PM, Chris Hostetter
<ho...@fucit.org> wrote:
> As the code stands now: we fail fast and let the person building hte index
> make a decision.

Indexing two fields when one could work is unfortunate though.
I think what we should support (eventually) is a max() function will also
work on a multi-valued field and select the maximum value (i.e. it will
simply bypass the check for multi-valued fields).

Then one can utilize sort-by-function to do
sort=max(author) asc

-Yonik
http://lucidimagination.com

Re: Sorting on multiValued fields via function query

Posted by Chris Hostetter <ho...@fucit.org>.
: But if lucene now can sort a multi-valued field without crashing when there
: are 'too many' unique values, and with easily described and predictable
: semantics (use the minimal value in the multi-valued field as sort key) --
: then it probably makes more sense for Solr to let you do that if you really
: want to, give you enough rope to hang yourself.

(Clarification: it's the the *maximal* value that gets used by lucene in 
that situation) 

I disagree.  

If we do what you describe we'd be relying on users to recognize when the 
sort logic is silently doing something "tricky" under the covers and make 
a concious decision as to if that was what they want, and if not then 
change their indexing to account for it.  

That seems like a recipe for confusion and unexpected behavior.

with SOLR-2339 in place, we tell users explicitly and up front "what you 
are attempting to do can not work as specified" and we force them to 
decide in advance how they want to deal with it -- by either indexing the 
lowest value or hte highest value (or both in distinct fields).

As the code stands now: we fail fast and let the person building hte index 
make a decision.  If we silently sort on the maximal value, we leave nasty 
headache for people who don't realize they are missusing a multiValued 
field and then wonder why some sorts don't do what they expect in some 
situations.

Bottom line: from day 1, we have always documented that sorting on 
multiValued fields (or fields that produced more then one document per 
document) didn't work.  If people didn't notice that documentation, they 
aren't likely to notice any documentation that says it will sort on the 
maximal value either -- SOLR-2339 may introduce a pain point for people 
upgrading, but it introduces it early and loudly, not quietly at some 
arbitrary moment in the future when they're beating their heads against a 
desk wondering why some sort isn't working the way they expect it to 
becuase they added some more values to a few documents.




-Hoss

Re: Sorting on multiValued fields via function query

Posted by Jonathan Rochkind <ro...@jhu.edu>.
Aha, oh well, not quite as good/flexible as I hoped.

Still, if lucene is now behaving somewhat more predictably/rationally 
when sorting on multi-valued fields, then I think, in response to your 
other email on a similar thread, perhaps SOLR-2339  is now a mistake.

When lucene was returning completely unpredictable results -- and even 
sometimes crashing entirely -- when sorting on a multi-valued field --- 
then I think in that situation it made a lot of sense for Solr to 
prevent you from doing that, which is I think what SOLR-2339 does?  So I 
don't think it was neccesarily a mistake in that context.

But if lucene now can sort a multi-valued field without crashing when 
there are 'too many' unique values, and with easily described and 
predictable semantics (use the minimal value in the multi-valued field 
as sort key) -- then it probably makes more sense for Solr to let you do 
that if you really want to, give you enough rope to hang yourself.

Jonathan

On 3/17/2011 10:49 AM, Yonik Seeley wrote:
> On Wed, Mar 16, 2011 at 6:08 PM, Jonathan Rochkind<ro...@jhu.edu>  wrote:
>> Also... if lucene is already capable of sorting on multi-valued field by
>> choosing the largest value.... largest vs. smallest is presumably just
>> arbitrary there, there is presumably no performance implication to choosing
>> the smallest instead of the largest. It just chooses the largest, according
>> to Yonik.
> It's a little more complicated than that.
> It's not so much an explicit feature in lucene, but just what
> naturally happens when building the field cache via uninverting an
> indexed field.
>
> It's pretty much this:
>
> for every term in the field:
>    for every document that matches that term:
>      value[document] = term
>
> And since terms are iterated from smallest to largest (and no, you
> can't reverse this)
> larger values end up overwriting smaller values.
> There's no simple patch to pick the smallest rather than the largest.
>
> In the past, lucene used to try and detect this multi-valued case by
> checking the number of values set in the whole array.  This was
> unreliable though and the check was discarded.
>
> -Yonik
> http://lucidimagination.com
>

Re: Sorting on multiValued fields via function query

Posted by Bill Bell <bi...@gmail.com>.
By the way, this could be done automatically by Solr or Lucene behind the scenes. 

Bill Bell
Sent from mobile


On Mar 17, 2011, at 9:02 AM, Bill Bell <bi...@gmail.com> wrote:

> Here is a work around. Stick the high value and low value into other fields. Use those fields for sorting.
> 
> Bill Bell
> Sent from mobile
> 
> 
> On Mar 17, 2011, at 8:49 AM, Yonik Seeley <yo...@lucidimagination.com> wrote:
> 
>> On Wed, Mar 16, 2011 at 6:08 PM, Jonathan Rochkind <ro...@jhu.edu> wrote:
>>> Also... if lucene is already capable of sorting on multi-valued field by
>>> choosing the largest value.... largest vs. smallest is presumably just
>>> arbitrary there, there is presumably no performance implication to choosing
>>> the smallest instead of the largest. It just chooses the largest, according
>>> to Yonik.
>> 
>> It's a little more complicated than that.
>> It's not so much an explicit feature in lucene, but just what
>> naturally happens when building the field cache via uninverting an
>> indexed field.
>> 
>> It's pretty much this:
>> 
>> for every term in the field:
>> for every document that matches that term:
>>   value[document] = term
>> 
>> And since terms are iterated from smallest to largest (and no, you
>> can't reverse this)
>> larger values end up overwriting smaller values.
>> There's no simple patch to pick the smallest rather than the largest.
>> 
>> In the past, lucene used to try and detect this multi-valued case by
>> checking the number of values set in the whole array.  This was
>> unreliable though and the check was discarded.
>> 
>> -Yonik
>> http://lucidimagination.com

Re: Sorting on multiValued fields via function query

Posted by Bill Bell <bi...@gmail.com>.
Here is a work around. Stick the high value and low value into other fields. Use those fields for sorting.

Bill Bell
Sent from mobile


On Mar 17, 2011, at 8:49 AM, Yonik Seeley <yo...@lucidimagination.com> wrote:

> On Wed, Mar 16, 2011 at 6:08 PM, Jonathan Rochkind <ro...@jhu.edu> wrote:
>> Also... if lucene is already capable of sorting on multi-valued field by
>> choosing the largest value.... largest vs. smallest is presumably just
>> arbitrary there, there is presumably no performance implication to choosing
>> the smallest instead of the largest. It just chooses the largest, according
>> to Yonik.
> 
> It's a little more complicated than that.
> It's not so much an explicit feature in lucene, but just what
> naturally happens when building the field cache via uninverting an
> indexed field.
> 
> It's pretty much this:
> 
> for every term in the field:
>  for every document that matches that term:
>    value[document] = term
> 
> And since terms are iterated from smallest to largest (and no, you
> can't reverse this)
> larger values end up overwriting smaller values.
> There's no simple patch to pick the smallest rather than the largest.
> 
> In the past, lucene used to try and detect this multi-valued case by
> checking the number of values set in the whole array.  This was
> unreliable though and the check was discarded.
> 
> -Yonik
> http://lucidimagination.com

Re: Sorting on multiValued fields via function query

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Wed, Mar 16, 2011 at 6:08 PM, Jonathan Rochkind <ro...@jhu.edu> wrote:
> Also... if lucene is already capable of sorting on multi-valued field by
> choosing the largest value.... largest vs. smallest is presumably just
> arbitrary there, there is presumably no performance implication to choosing
> the smallest instead of the largest. It just chooses the largest, according
> to Yonik.

It's a little more complicated than that.
It's not so much an explicit feature in lucene, but just what
naturally happens when building the field cache via uninverting an
indexed field.

It's pretty much this:

for every term in the field:
  for every document that matches that term:
    value[document] = term

And since terms are iterated from smallest to largest (and no, you
can't reverse this)
larger values end up overwriting smaller values.
There's no simple patch to pick the smallest rather than the largest.

In the past, lucene used to try and detect this multi-valued case by
checking the number of values set in the whole array.  This was
unreliable though and the check was discarded.

-Yonik
http://lucidimagination.com

Re: Sorting on multiValued fields via function query

Posted by Bill Bell <bi...@gmail.com>.
I agree with this and it is even needed for function sorting for multvalued fields. See geohash patch for one wY to deal with multivalued fields on distance. Not ideal but it works efficiently.

Bill Bell
Sent from mobile


On Mar 16, 2011, at 4:08 PM, Jonathan Rochkind <ro...@jhu.edu> wrote:

> Huh, so lucene is actually doing what has been commonly described as impossible in Solr?
> 
> But is Solr trunk, as the OP person seemed to report, still not aware of this and raising on a sort on multi-valued field, instead of just saying, okay, we'll just pass it to lucene anyway and go with lucene's approach to sorting on multi-valued field (that is, apparently, using the largest value)?
> 
> If so... that kind of sounds like a bug/misfeature, yes, no?
> 
> Also... if lucene is already capable of sorting on multi-valued field by choosing the largest value.... largest vs. smallest is presumably just arbitrary there, there is presumably no performance implication to choosing the smallest instead of the largest. It just chooses the largest, according to Yonik.
> 
> So... if someone patched lucene, so whether it chose the largest or smallest in that case was a parameter passed in -- probably not a large patch since lucene, says Yonik, already has been enhanced to choose largest always -- and then patched Solr to take a param and pass it to Lucene for this purpose, which presumably also wouldn't be a large patch if lucene supported it....   then we'd have the feature OP asked for.
> 
> Based on Yonik's description (assuming I understand correctly and he's correct), it doesn't sound like a lot of code. But it's still beyond my unfamiliar-with-lucene-code-not-so-great-at-java abilities, nor do I have the interest for my own app needs at the moment. But if OP or someone else has both.... sounds like a plausible feature?
> 
> On 3/16/2011 6:00 PM, Yonik Seeley wrote:
>> On Wed, Mar 16, 2011 at 5:46 PM, Chris Hostetter
>> <ho...@fucit.org>  wrote:
>>> : However, many of our multiValued fields are single valued for the majority
>>> : of documents in our index so we may not have noticed the incorrect sorting
>>> : behaviors.
>>> 
>>> that would make sense ... if you use a multiValued field as if it were
>>> single valued, you would never enocunter a problem.  if you had *some*
>>> multivalued fields your results would be sorted extremely arbitrarily for
>>> those docs that did have multiple values, unless you had more distinct
>>> values then you had documents -- at which point you would get a hard crash
>>> at query time.
>> AFAIK, not any more.  Since that behavior was very unreliable, it has
>> been removed and you can reliably sort by any multi-valued field in
>> lucene (with the sort order being defined by the largest value if
>> there are multiple).
>> 
>> -Yonik
>> http://lucidimagination.com
>> 

Re: Sorting on multiValued fields via function query

Posted by Jonathan Rochkind <ro...@jhu.edu>.
Huh, so lucene is actually doing what has been commonly described as 
impossible in Solr?

But is Solr trunk, as the OP person seemed to report, still not aware of 
this and raising on a sort on multi-valued field, instead of just 
saying, okay, we'll just pass it to lucene anyway and go with lucene's 
approach to sorting on multi-valued field (that is, apparently, using 
the largest value)?

If so... that kind of sounds like a bug/misfeature, yes, no?

Also... if lucene is already capable of sorting on multi-valued field by 
choosing the largest value.... largest vs. smallest is presumably just 
arbitrary there, there is presumably no performance implication to 
choosing the smallest instead of the largest. It just chooses the 
largest, according to Yonik.

So... if someone patched lucene, so whether it chose the largest or 
smallest in that case was a parameter passed in -- probably not a large 
patch since lucene, says Yonik, already has been enhanced to choose 
largest always -- and then patched Solr to take a param and pass it to 
Lucene for this purpose, which presumably also wouldn't be a large patch 
if lucene supported it....   then we'd have the feature OP asked for.

Based on Yonik's description (assuming I understand correctly and he's 
correct), it doesn't sound like a lot of code. But it's still beyond my 
unfamiliar-with-lucene-code-not-so-great-at-java abilities, nor do I 
have the interest for my own app needs at the moment. But if OP or 
someone else has both.... sounds like a plausible feature?

On 3/16/2011 6:00 PM, Yonik Seeley wrote:
> On Wed, Mar 16, 2011 at 5:46 PM, Chris Hostetter
> <ho...@fucit.org>  wrote:
>> : However, many of our multiValued fields are single valued for the majority
>> : of documents in our index so we may not have noticed the incorrect sorting
>> : behaviors.
>>
>> that would make sense ... if you use a multiValued field as if it were
>> single valued, you would never enocunter a problem.  if you had *some*
>> multivalued fields your results would be sorted extremely arbitrarily for
>> those docs that did have multiple values, unless you had more distinct
>> values then you had documents -- at which point you would get a hard crash
>> at query time.
> AFAIK, not any more.  Since that behavior was very unreliable, it has
> been removed and you can reliably sort by any multi-valued field in
> lucene (with the sort order being defined by the largest value if
> there are multiple).
>
> -Yonik
> http://lucidimagination.com
>

Re: Sorting on multiValued fields via function query

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Wed, Mar 16, 2011 at 5:46 PM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> : However, many of our multiValued fields are single valued for the majority
> : of documents in our index so we may not have noticed the incorrect sorting
> : behaviors.
>
> that would make sense ... if you use a multiValued field as if it were
> single valued, you would never enocunter a problem.  if you had *some*
> multivalued fields your results would be sorted extremely arbitrarily for
> those docs that did have multiple values, unless you had more distinct
> values then you had documents -- at which point you would get a hard crash
> at query time.

AFAIK, not any more.  Since that behavior was very unreliable, it has
been removed and you can reliably sort by any multi-valued field in
lucene (with the sort order being defined by the largest value if
there are multiple).

-Yonik
http://lucidimagination.com

Re: Sorting on multiValued fields via function query

Posted by Chris Hostetter <ho...@fucit.org>.
: However, many of our multiValued fields are single valued for the majority
: of documents in our index so we may not have noticed the incorrect sorting
: behaviors.

that would make sense ... if you use a multiValued field as if it were 
single valued, you would never enocunter a problem.  if you had *some* 
multivalued fields your results would be sorted extremely arbitrarily for 
those docs that did have multiple values, unless you had more distinct 
values then you had documents -- at which point you would get a hard crash 
at query time.

: Regardless, I understand the reasoning behind the restriction, I'm
: interested in getting around it by using a functionQuery to reduce
: multiValued fields to a single value.  It sounds like this isn't possible,

I don't think we have any functions that do that -- functions are composed 
of valuesources which may be composed of other value sources but 
ultimatley the data comes from somewhere, and in every case i can think of 
(except for constant values) that data comes from the FieldCache -- the 
same FieldCache used for sorting.

I don't think there are any value sources that will let you specify a 
multiValued field, and then pick one of those values based on a 
rule/function ... even the "PolyFields used for spatial search work by 
using multiple field names unde the covers (N distinct field names for an 
N-dimensional space)

: is that correct?  Ideally I'd like to sort by the maximum value on
: descending sorts and the minimum value on ascending sorts.  Is there any
: movement towards implementing this sort of behavior?

this is a fairly classic usecase of just having multiple fields.  even if 
the logic was implemented to support this at query time, it could never be 
faster then sorting on asingle valued field that you populat with the 
min/max at indexing time -- the mantra of fast I/R is that if you can 
precompute it independently of the individual search critera, you should 
(it's the whole foundation for why the inverted index exists)


-Hoss

Re: Sorting on multiValued fields via function query

Posted by "harish.agarwal" <ha...@gmail.com>.
Hi David,

It did seem to work correctly for me - we had it running on our production
indexes for some time and we never noticed any strange sorting behavior. 
However, many of our multiValued fields are single valued for the majority
of documents in our index so we may not have noticed the incorrect sorting
behaviors.

Regardless, I understand the reasoning behind the restriction, I'm
interested in getting around it by using a functionQuery to reduce
multiValued fields to a single value.  It sounds like this isn't possible,
is that correct?  Ideally I'd like to sort by the maximum value on
descending sorts and the minimum value on ascending sorts.  Is there any
movement towards implementing this sort of behavior?

Best,
-Harish

--
View this message in context: http://lucene.472066.n3.nabble.com/Sorting-on-multiValued-fields-via-function-query-tp2681833p2688288.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Sorting on multiValued fields via function query

Posted by "David Smiley (@MITRE.org)" <DS...@mitre.org>.
Hi Harish. 
Did sorting on multiValued fields actually work correctly for you before?
I'd be surprised if so.  I could be wrong but I think you previously always
got the sorting affects of whatever was the last indexed value. It is indeed
the case that the FieldCache only supports up to one indexed value per
field. Recently Hoss added sanity checks that you are seeing the results of: 
https://issues.apache.org/jira/browse/SOLR-2339   You might want to comment
on that issue with proof (e.g. a simple test) that it worked before but not
now.

~ David

-----
 Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
--
View this message in context: http://lucene.472066.n3.nabble.com/Sorting-on-multiValued-fields-via-function-query-tp2681833p2685485.html
Sent from the Solr - User mailing list archive at Nabble.com.