You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jeff Wartes <jw...@whitepages.com> on 2013/08/14 20:26:24 UTC

Distance sort on a multi-value field

I'm still pondering aggregate-type operations for scoring multi-valued
fields (original thread: http://goo.gl/zOX53f ), and it occurred to me
that distance-sort with SpatialRecursivePrefixTreeFieldType must be doing
something like that.

Somewhat surprisingly I don't see this in the documentation anywhere, but
I presume the example query: (from:
http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4)
"q={!geofilt score=distance sfield=geo pt=54.729696,-98.525391 d=10}"

assigns the distance/score based on the *closest* lat/long if the sfield
is a multi-valued field.

That's a reasonable default, but it's a bit arbitrary. Can I sort based on
the *furthest* lat/long in the document? Or the average distance?

Anyone know more about how this works and could give me some pointers?

Thanks.


Re: Distance sort on a multi-value field

Posted by "David Smiley (@MITRE.org)" <DS...@mitre.org>.
Awesome!

Be sure to "watch" the JIRA issue as it develops.  The patch will improve
(I've already improved it but not posted it) and one day a solution is bound
to get committed.

~ David


Jeff Wartes wrote
> This is actually pretty far afield from my original subject, but it turns
> out that I also had issues  with NRT and multi-field geospatial
> performance in Solr 4, so I'll follow that up.
> 
> 
> I've been testing and working with David's SOLR-5170 patch ever since he
> posted it, and I pushed it into production with only some cosmetic changes
> a few hours ago. 
> I have a relatively low update and query rate for this particular query
> type, (something like 2 updates/sec, 10 queries/sec) but a short
> autosoftcommit time. (5 sec) Based on the data so far this patch looks
> like it's brought my average response time down from 4 seconds to about
> 50ms.
> 
> Very nice!
> 
> 
> 
> On 8/20/13 7:37 PM, "David Smiley (@MITRE.org)" &lt;

> DSMILEY@

> &gt; wrote:
> 
>>The distance sorting code in SOLR-2155 is roughly equivalent to the code
>>that
>>RPT uses (RPT has its lineage in SOLR-2155 after all).  I just reviewed it
>>to double-check.  It's possible the behavior is slightly better in
>>SOLR-2155
>>because the cache (a Solr cache) contains normal hard-references whereas
>>RPT
>>has one based on weak references, which will linger longer.  But I think
>>the
>>likelihood of OOM is the same.
>>
>>Any way, the current best option is
>>https://issues.apache.org/jira/browse/SOLR-5170  which I posted a few days
>>ago.
>>
>>~ David
>>
>>
>>Billnbell wrote
>>> We have been using 2155 for over 6 months in production with over 2M
>>>hits
>>> every 10 minutes. No OOM yet.
>>> 
>>> 2155 seems great, and would this issue be any worse than 2155?
>>> 
>>> 
>>> 
>>> On Wed, Aug 14, 2013 at 4:08 PM, Jeff Wartes &lt;
>>
>>> jwartes@
>>
>>> &gt; wrote:
>>> 
>>>>
>>>> Hm, "Give me all the stores that only have branches in this area" might
>>>> be
>>>> a plausible use case for farthest distance.
>>>> That's essentially a "contains" question though, so maybe that's
>>>>already
>>>> supported? I guess it depends on how contains/intersects/etc handle
>>>> multi-values. I feel like multi-value interaction really deserves its
>>>>own
>>>> section in the documentation.
>>>>
>>>>
>>>> I'm aware of the memory issue, but it seems like if you want sort
>>>> multi-valued points, it's either this or try to pull in the 2155 patch.
>>>> In
>>>> general I'd rather go with the thing that's being maintained.
>>>>
>>>>
>>>> Thanks for the code pointer. You're right, that doesn't look like
>>>> something I can easily use for more general aggregate scoring control.
>>>>Ah
>>>> well.
>>>>
>>>>
>>>>
>>>> On 8/14/13 12:35 PM, "Smiley, David W." &lt;
>>
>>> dsmiley@
>>
>>> &gt; wrote:
>>>>
>>>> >
>>>> >
>>>> >On 8/14/13 2:26 PM, "Jeff Wartes" &lt;
>>
>>> jwartes@
>>
>>> &gt; wrote:
>>>> >
>>>> >>
>>>> >>I'm still pondering aggregate-type operations for scoring
>>>>multi-valued
>>>> >>fields (original thread: http://goo.gl/zOX53f ), and it occurred to
>>>>me
>>>> >>that distance-sort with SpatialRecursivePrefixTreeFieldType must be
>>>> doing
>>>> >>something like that.
>>>> >
>>>> >It isn't.
>>>> >
>>>> >>
>>>> >>Somewhat surprisingly I don't see this in the documentation anywhere,
>>>> but
>>>> >>I presume the example query: (from:
>>>> >>http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4)
>>>> >>"q={!geofilt score=distance sfield=geo pt=54.729696,-98.525391 d=10}"
>>>> >>
>>>> >>assigns the distance/score based on the *closest* lat/long if the
>>>> sfield
>>>> >>is a multi-valued field.
>>>> >
>>>> >Yes it does.
>>>> >
>>>> >>
>>>> >>That's a reasonable default, but it's a bit arbitrary. Can I sort
>>>>based
>>>> >>on
>>>> >>the *furthest* lat/long in the document? Or the average distance?
>>>> >>
>>>> >>Anyone know more about how this works and could give me some
>>>>pointers?
>>>> >
>>>> >I considered briefly supporting the farthest distance but dismissed it
>>>> as
>>>> >I saw no real use-case.  I didn't think of the average distance;
>>>>that's
>>>> >plausible.  Any way, you're best bet is to dig into the code.  The
>>>> >relevant part is ShapeFieldCacheDistanceValueSource.
>>>> >
>>>> >FYI something to keep in mind:
>>>> >https://issues.apache.org/jira/browse/LUCENE-4698
>>>> >
>>>> >~ David
>>>> >
>>>>
>>>>
>>> 
>>> 
>>> -- 
>>> Bill Bell
>>
>>> billnbell@
>>
>>> cell 720-256-8076
>>
>>
>>
>>
>>
>>-----
>> Author: 
>>http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
>>--
>>View this message in context:
>>http://lucene.472066.n3.nabble.com/Distance-sort-on-a-multi-value-field-tp
>>4084666p4085797.html
>>Sent from the Solr - User mailing list archive at Nabble.com.





-----
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: http://lucene.472066.n3.nabble.com/Distance-sort-on-a-multi-value-field-tp4084666p4086226.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Distance sort on a multi-value field

Posted by Jeff Wartes <jw...@whitepages.com>.
This is actually pretty far afield from my original subject, but it turns
out that I also had issues  with NRT and multi-field geospatial
performance in Solr 4, so I'll follow that up.


I've been testing and working with David's SOLR-5170 patch ever since he
posted it, and I pushed it into production with only some cosmetic changes
a few hours ago. 
I have a relatively low update and query rate for this particular query
type, (something like 2 updates/sec, 10 queries/sec) but a short
autosoftcommit time. (5 sec) Based on the data so far this patch looks
like it's brought my average response time down from 4 seconds to about
50ms.

Very nice!



On 8/20/13 7:37 PM, "David Smiley (@MITRE.org)" <DS...@mitre.org> wrote:

>The distance sorting code in SOLR-2155 is roughly equivalent to the code
>that
>RPT uses (RPT has its lineage in SOLR-2155 after all).  I just reviewed it
>to double-check.  It's possible the behavior is slightly better in
>SOLR-2155
>because the cache (a Solr cache) contains normal hard-references whereas
>RPT
>has one based on weak references, which will linger longer.  But I think
>the
>likelihood of OOM is the same.
>
>Any way, the current best option is
>https://issues.apache.org/jira/browse/SOLR-5170  which I posted a few days
>ago.
>
>~ David
>
>
>Billnbell wrote
>> We have been using 2155 for over 6 months in production with over 2M
>>hits
>> every 10 minutes. No OOM yet.
>> 
>> 2155 seems great, and would this issue be any worse than 2155?
>> 
>> 
>> 
>> On Wed, Aug 14, 2013 at 4:08 PM, Jeff Wartes &lt;
>
>> jwartes@
>
>> &gt; wrote:
>> 
>>>
>>> Hm, "Give me all the stores that only have branches in this area" might
>>> be
>>> a plausible use case for farthest distance.
>>> That's essentially a "contains" question though, so maybe that's
>>>already
>>> supported? I guess it depends on how contains/intersects/etc handle
>>> multi-values. I feel like multi-value interaction really deserves its
>>>own
>>> section in the documentation.
>>>
>>>
>>> I'm aware of the memory issue, but it seems like if you want sort
>>> multi-valued points, it's either this or try to pull in the 2155 patch.
>>> In
>>> general I'd rather go with the thing that's being maintained.
>>>
>>>
>>> Thanks for the code pointer. You're right, that doesn't look like
>>> something I can easily use for more general aggregate scoring control.
>>>Ah
>>> well.
>>>
>>>
>>>
>>> On 8/14/13 12:35 PM, "Smiley, David W." &lt;
>
>> dsmiley@
>
>> &gt; wrote:
>>>
>>> >
>>> >
>>> >On 8/14/13 2:26 PM, "Jeff Wartes" &lt;
>
>> jwartes@
>
>> &gt; wrote:
>>> >
>>> >>
>>> >>I'm still pondering aggregate-type operations for scoring
>>>multi-valued
>>> >>fields (original thread: http://goo.gl/zOX53f ), and it occurred to
>>>me
>>> >>that distance-sort with SpatialRecursivePrefixTreeFieldType must be
>>> doing
>>> >>something like that.
>>> >
>>> >It isn't.
>>> >
>>> >>
>>> >>Somewhat surprisingly I don't see this in the documentation anywhere,
>>> but
>>> >>I presume the example query: (from:
>>> >>http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4)
>>> >>"q={!geofilt score=distance sfield=geo pt=54.729696,-98.525391 d=10}"
>>> >>
>>> >>assigns the distance/score based on the *closest* lat/long if the
>>> sfield
>>> >>is a multi-valued field.
>>> >
>>> >Yes it does.
>>> >
>>> >>
>>> >>That's a reasonable default, but it's a bit arbitrary. Can I sort
>>>based
>>> >>on
>>> >>the *furthest* lat/long in the document? Or the average distance?
>>> >>
>>> >>Anyone know more about how this works and could give me some
>>>pointers?
>>> >
>>> >I considered briefly supporting the farthest distance but dismissed it
>>> as
>>> >I saw no real use-case.  I didn't think of the average distance;
>>>that's
>>> >plausible.  Any way, you're best bet is to dig into the code.  The
>>> >relevant part is ShapeFieldCacheDistanceValueSource.
>>> >
>>> >FYI something to keep in mind:
>>> >https://issues.apache.org/jira/browse/LUCENE-4698
>>> >
>>> >~ David
>>> >
>>>
>>>
>> 
>> 
>> -- 
>> Bill Bell
>
>> billnbell@
>
>> cell 720-256-8076
>
>
>
>
>
>-----
> Author: 
>http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
>--
>View this message in context:
>http://lucene.472066.n3.nabble.com/Distance-sort-on-a-multi-value-field-tp
>4084666p4085797.html
>Sent from the Solr - User mailing list archive at Nabble.com.


Re: Distance sort on a multi-value field

Posted by "David Smiley (@MITRE.org)" <DS...@mitre.org>.
The distance sorting code in SOLR-2155 is roughly equivalent to the code that
RPT uses (RPT has its lineage in SOLR-2155 after all).  I just reviewed it
to double-check.  It's possible the behavior is slightly better in SOLR-2155
because the cache (a Solr cache) contains normal hard-references whereas RPT
has one based on weak references, which will linger longer.  But I think the
likelihood of OOM is the same.

Any way, the current best option is
https://issues.apache.org/jira/browse/SOLR-5170  which I posted a few days
ago.

~ David


Billnbell wrote
> We have been using 2155 for over 6 months in production with over 2M hits
> every 10 minutes. No OOM yet.
> 
> 2155 seems great, and would this issue be any worse than 2155?
> 
> 
> 
> On Wed, Aug 14, 2013 at 4:08 PM, Jeff Wartes &lt;

> jwartes@

> &gt; wrote:
> 
>>
>> Hm, "Give me all the stores that only have branches in this area" might
>> be
>> a plausible use case for farthest distance.
>> That's essentially a "contains" question though, so maybe that's already
>> supported? I guess it depends on how contains/intersects/etc handle
>> multi-values. I feel like multi-value interaction really deserves its own
>> section in the documentation.
>>
>>
>> I'm aware of the memory issue, but it seems like if you want sort
>> multi-valued points, it's either this or try to pull in the 2155 patch.
>> In
>> general I'd rather go with the thing that's being maintained.
>>
>>
>> Thanks for the code pointer. You're right, that doesn't look like
>> something I can easily use for more general aggregate scoring control. Ah
>> well.
>>
>>
>>
>> On 8/14/13 12:35 PM, "Smiley, David W." &lt;

> dsmiley@

> &gt; wrote:
>>
>> >
>> >
>> >On 8/14/13 2:26 PM, "Jeff Wartes" &lt;

> jwartes@

> &gt; wrote:
>> >
>> >>
>> >>I'm still pondering aggregate-type operations for scoring multi-valued
>> >>fields (original thread: http://goo.gl/zOX53f ), and it occurred to me
>> >>that distance-sort with SpatialRecursivePrefixTreeFieldType must be
>> doing
>> >>something like that.
>> >
>> >It isn't.
>> >
>> >>
>> >>Somewhat surprisingly I don't see this in the documentation anywhere,
>> but
>> >>I presume the example query: (from:
>> >>http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4)
>> >>"q={!geofilt score=distance sfield=geo pt=54.729696,-98.525391 d=10}"
>> >>
>> >>assigns the distance/score based on the *closest* lat/long if the
>> sfield
>> >>is a multi-valued field.
>> >
>> >Yes it does.
>> >
>> >>
>> >>That's a reasonable default, but it's a bit arbitrary. Can I sort based
>> >>on
>> >>the *furthest* lat/long in the document? Or the average distance?
>> >>
>> >>Anyone know more about how this works and could give me some pointers?
>> >
>> >I considered briefly supporting the farthest distance but dismissed it
>> as
>> >I saw no real use-case.  I didn't think of the average distance; that's
>> >plausible.  Any way, you're best bet is to dig into the code.  The
>> >relevant part is ShapeFieldCacheDistanceValueSource.
>> >
>> >FYI something to keep in mind:
>> >https://issues.apache.org/jira/browse/LUCENE-4698
>> >
>> >~ David
>> >
>>
>>
> 
> 
> -- 
> Bill Bell

> billnbell@

> cell 720-256-8076





-----
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: http://lucene.472066.n3.nabble.com/Distance-sort-on-a-multi-value-field-tp4084666p4085797.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Distance sort on a multi-value field

Posted by William Bell <bi...@gmail.com>.
We have been using 2155 for over 6 months in production with over 2M hits
every 10 minutes. No OOM yet.

2155 seems great, and would this issue be any worse than 2155?



On Wed, Aug 14, 2013 at 4:08 PM, Jeff Wartes <jw...@whitepages.com> wrote:

>
> Hm, "Give me all the stores that only have branches in this area" might be
> a plausible use case for farthest distance.
> That's essentially a "contains" question though, so maybe that's already
> supported? I guess it depends on how contains/intersects/etc handle
> multi-values. I feel like multi-value interaction really deserves its own
> section in the documentation.
>
>
> I'm aware of the memory issue, but it seems like if you want sort
> multi-valued points, it's either this or try to pull in the 2155 patch. In
> general I'd rather go with the thing that's being maintained.
>
>
> Thanks for the code pointer. You're right, that doesn't look like
> something I can easily use for more general aggregate scoring control. Ah
> well.
>
>
>
> On 8/14/13 12:35 PM, "Smiley, David W." <ds...@mitre.org> wrote:
>
> >
> >
> >On 8/14/13 2:26 PM, "Jeff Wartes" <jw...@whitepages.com> wrote:
> >
> >>
> >>I'm still pondering aggregate-type operations for scoring multi-valued
> >>fields (original thread: http://goo.gl/zOX53f ), and it occurred to me
> >>that distance-sort with SpatialRecursivePrefixTreeFieldType must be doing
> >>something like that.
> >
> >It isn't.
> >
> >>
> >>Somewhat surprisingly I don't see this in the documentation anywhere, but
> >>I presume the example query: (from:
> >>http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4)
> >>"q={!geofilt score=distance sfield=geo pt=54.729696,-98.525391 d=10}"
> >>
> >>assigns the distance/score based on the *closest* lat/long if the sfield
> >>is a multi-valued field.
> >
> >Yes it does.
> >
> >>
> >>That's a reasonable default, but it's a bit arbitrary. Can I sort based
> >>on
> >>the *furthest* lat/long in the document? Or the average distance?
> >>
> >>Anyone know more about how this works and could give me some pointers?
> >
> >I considered briefly supporting the farthest distance but dismissed it as
> >I saw no real use-case.  I didn't think of the average distance; that's
> >plausible.  Any way, you're best bet is to dig into the code.  The
> >relevant part is ShapeFieldCacheDistanceValueSource.
> >
> >FYI something to keep in mind:
> >https://issues.apache.org/jira/browse/LUCENE-4698
> >
> >~ David
> >
>
>


-- 
Bill Bell
billnbell@gmail.com
cell 720-256-8076

Re: Distance sort on a multi-value field

Posted by Jeff Wartes <jw...@whitepages.com>.
Hm, "Give me all the stores that only have branches in this area" might be
a plausible use case for farthest distance.
That's essentially a "contains" question though, so maybe that's already
supported? I guess it depends on how contains/intersects/etc handle
multi-values. I feel like multi-value interaction really deserves its own
section in the documentation.


I'm aware of the memory issue, but it seems like if you want sort
multi-valued points, it's either this or try to pull in the 2155 patch. In
general I'd rather go with the thing that's being maintained.


Thanks for the code pointer. You're right, that doesn't look like
something I can easily use for more general aggregate scoring control. Ah
well.



On 8/14/13 12:35 PM, "Smiley, David W." <ds...@mitre.org> wrote:

>
>
>On 8/14/13 2:26 PM, "Jeff Wartes" <jw...@whitepages.com> wrote:
>
>>
>>I'm still pondering aggregate-type operations for scoring multi-valued
>>fields (original thread: http://goo.gl/zOX53f ), and it occurred to me
>>that distance-sort with SpatialRecursivePrefixTreeFieldType must be doing
>>something like that.
>
>It isn't.
>
>>
>>Somewhat surprisingly I don't see this in the documentation anywhere, but
>>I presume the example query: (from:
>>http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4)
>>"q={!geofilt score=distance sfield=geo pt=54.729696,-98.525391 d=10}"
>>
>>assigns the distance/score based on the *closest* lat/long if the sfield
>>is a multi-valued field.
>
>Yes it does.
>
>>
>>That's a reasonable default, but it's a bit arbitrary. Can I sort based
>>on
>>the *furthest* lat/long in the document? Or the average distance?
>>
>>Anyone know more about how this works and could give me some pointers?
>
>I considered briefly supporting the farthest distance but dismissed it as
>I saw no real use-case.  I didn't think of the average distance; that's
>plausible.  Any way, you're best bet is to dig into the code.  The
>relevant part is ShapeFieldCacheDistanceValueSource.
>
>FYI something to keep in mind:
>https://issues.apache.org/jira/browse/LUCENE-4698
>
>~ David
>


Re: Distance sort on a multi-value field

Posted by "Smiley, David W." <ds...@mitre.org>.

On 8/14/13 2:26 PM, "Jeff Wartes" <jw...@whitepages.com> wrote:

>
>I'm still pondering aggregate-type operations for scoring multi-valued
>fields (original thread: http://goo.gl/zOX53f ), and it occurred to me
>that distance-sort with SpatialRecursivePrefixTreeFieldType must be doing
>something like that.

It isn't.

>
>Somewhat surprisingly I don't see this in the documentation anywhere, but
>I presume the example query: (from:
>http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4)
>"q={!geofilt score=distance sfield=geo pt=54.729696,-98.525391 d=10}"
>
>assigns the distance/score based on the *closest* lat/long if the sfield
>is a multi-valued field.

Yes it does.

>
>That's a reasonable default, but it's a bit arbitrary. Can I sort based on
>the *furthest* lat/long in the document? Or the average distance?
>
>Anyone know more about how this works and could give me some pointers?

I considered briefly supporting the farthest distance but dismissed it as
I saw no real use-case.  I didn't think of the average distance; that's
plausible.  Any way, you're best bet is to dig into the code.  The
relevant part is ShapeFieldCacheDistanceValueSource.

FYI something to keep in mind:
https://issues.apache.org/jira/browse/LUCENE-4698

~ David