You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Thomas Heigl <th...@umschalt.com> on 2012/08/03 10:56:27 UTC

Tuning caching of geofilt queries

Hey all,

Our production system is heavily optimized for caching and nearly all parts
of queries are satisfied by filter caches. The only filter that varies a
lot from user to user is the location and distance. Currently we use the
default location field type and index lat/long coordinates as we get them
from Geonames and GMaps with varying decimal precision.

My question is: Does it make sense to round these coordinates (a) while
indexing and/or (b) while querying to optimize cache hits? Our maximum
required resolution for geo queries is 1km and we can tolerate minor errors
so I could round to two decimal points for most of our queries.

E.g. Instead of querying like this

fq=_query_:"{!geofilt sfield=user.location_p pt=48.19815,16.3943
> d=50.0}"&sfield=user.location_p&pt=48.1981,16.394


we would round to

fq=_query_:"{!geofilt sfield=user.location_p pt=48.19,16.39
> d=50.0}"&sfield=user.location_p&pt=48.19,16.39


Any feedback would be greatly appreciated.

Cheers,

Thomas

Re: Tuning caching of geofilt queries

Posted by Erick Erickson <er...@gmail.com>.
I don't think rounding will affect cache hits in either case _unless_
the input point for different queries can be very close to each other.

Think of the filter cache as being composed of a map where the key
is the (raw) filter query and the value is the set of documents in your
corpus that satisfy it.

So the only time rounding would help, is if it's likely that two
users enter very similar points at query time, i.e.
89.1234 and 89.1236. If you're giving them a set of choices
that are pre-defined (city center, say), then the values should be
identical to all the decimal places so rounding doesn't do you much
good.

You say you can tolerate some slop, so using bounding box might
speed up your queries...

Best
Erick

On Fri, Aug 3, 2012 at 4:56 AM, Thomas Heigl <th...@umschalt.com> wrote:
> Hey all,
>
> Our production system is heavily optimized for caching and nearly all parts
> of queries are satisfied by filter caches. The only filter that varies a
> lot from user to user is the location and distance. Currently we use the
> default location field type and index lat/long coordinates as we get them
> from Geonames and GMaps with varying decimal precision.
>
> My question is: Does it make sense to round these coordinates (a) while
> indexing and/or (b) while querying to optimize cache hits? Our maximum
> required resolution for geo queries is 1km and we can tolerate minor errors
> so I could round to two decimal points for most of our queries.
>
> E.g. Instead of querying like this
>
> fq=_query_:"{!geofilt sfield=user.location_p pt=48.19815,16.3943
>> d=50.0}"&sfield=user.location_p&pt=48.1981,16.394
>
>
> we would round to
>
> fq=_query_:"{!geofilt sfield=user.location_p pt=48.19,16.39
>> d=50.0}"&sfield=user.location_p&pt=48.19,16.39
>
>
> Any feedback would be greatly appreciated.
>
> Cheers,
>
> Thomas

Re: Tuning caching of geofilt queries

Posted by Lance Norskog <go...@gmail.com>.
In other computations I found exactly zero performance difference
between floats & doubles. Even with long arrays number which you would
expect to be sensitive to locality effects.

On Fri, Aug 10, 2012 at 11:20 AM, David Smiley (@MITRE.org)
<DS...@mitre.org> wrote:
> Yeah it is... I rather like this write-up:
> https://sites.google.com/site/trescopter/Home/concepts/required-precision-for-gps-calculations#TOC-Precision-of-Float-and-Double
> -- which also arrives at 2.37m worse case.
>
> Aside from RAM savings, I wonder if there is any noticeable performance
> difference for LatLonType.
>
>
>
> -----
>  Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Tuning-caching-of-geofilt-queries-tp3998975p4000534.html
> Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Lance Norskog
goksron@gmail.com

Re: Tuning caching of geofilt queries

Posted by "David Smiley (@MITRE.org)" <DS...@mitre.org>.
Yeah it is... I rather like this write-up:
https://sites.google.com/site/trescopter/Home/concepts/required-precision-for-gps-calculations#TOC-Precision-of-Float-and-Double
-- which also arrives at 2.37m worse case.  

Aside from RAM savings, I wonder if there is any noticeable performance
difference for LatLonType.



-----
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: http://lucene.472066.n3.nabble.com/Tuning-caching-of-geofilt-queries-tp3998975p4000534.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Tuning caching of geofilt queries

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Fri, Aug 10, 2012 at 1:47 PM, David Smiley (@MITRE.org)
<DS...@mitre.org> wrote:
> Information I've read vary on exactly what is the accuracy of float
> vs double but at a kilometer there's no question a double is overkill.

Back of the envelope:

23 mantissa bits + 1 implied bit == 24 effective mantissa bits in a 32
bit float.

40,000 km circumference / (2^24) = .0024 km  (i.e. our resolution at
the equator is 2.4m at best - there will be some lost unused space at
the beginning and end of the +-180 number-line).

Is that in line with what you've read?

-Yonik
http://lucidworks.com

Re: Tuning caching of geofilt queries

Posted by "David Smiley (@MITRE.org)" <DS...@mitre.org>.
Chris's response is quite good, and I have a couple things to add:

1. Since you can tolerate 1km slop, try defining the dynamic field
*_coordinate as tfloat instead of tdouble.  This will halve your memory
requirements, but I'm not sure if it will be any faster -- it's worth a shot
since you've already indicated that your requirements don't call for a
double.  Information I've read vary on exactly what is the accuracy of float
vs double but at a kilometer there's no question a double is overkill.

2. Try my Solr 3.x spatial plugin called "SOLR-2155" at github:
https://github.com/dsmiley/SOLR-2155   It is very fast at filtering (even
for circles) as indicated in this stackoverflow thread: 
http://stackoverflow.com/questions/11636376/solr-performance-on-ec2-for-geospatial-queries  
in which it destroys LatLonType in a big data speed test :-D.   You should
be happy to know that this technology is on its way into Solr 4, albeit not
quite yet.

Cheers,
  ~ David Smiley



-----
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: http://lucene.472066.n3.nabble.com/Tuning-caching-of-geofilt-queries-tp3998975p4000525.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Tuning caching of geofilt queries

Posted by Chris Hostetter <ho...@fucit.org>.
: My question is: Does it make sense to round these coordinates (a) while
: indexing and/or (b) while querying to optimize cache hits? Our maximum
: required resolution for geo queries is 1km and we can tolerate minor errors
: so I could round to two decimal points for most of our queries.

: fq=_query_:"{!geofilt sfield=user.location_p pt=48.19815,16.3943
: > d=50.0}"&sfield=user.location_p&pt=48.1981,16.394

1) i don't see any reason for the _query_ hack ... this 
should be more efficient, and easier on the eyes...

 fq={!geofilt sfield=user.location_p pt=48.19815,16.3943 d=50.0}
&sfield=user.location_p
&pt=48.1981,16.394

2) as Erick mentioned, rounding will only do you good if you expect 
lots of queries from differnet users that when rounded, result in the same 
point

3) you might consider disabling the caching of your geofilt queries 
completley using the cache=false param. for {!geofilt} you should also be 
able to combine this with the "cost" localparm to take advantage of 
post-filtering, so that the distance calculations are only computed for 
documents that already match your query and other cached filters...

http://wiki.apache.org/solr/CommonQueryParameters#Caching_of_filters
http://searchhub.org/dev/2012/02/10/advanced-filter-caching-in-solr/

4) something you also might wnat to consider (depending on your data and 
how much geo surface area you are dealing with) is along the lines of 
Erick's bounding box suggestion: use two filters; a "course" 
bounding box that you cache, and a precise geofilt using teh cache & cost 
params mentioned in #3.

that way you have a fininite number of bounding box filters that will be 
cached and help quickly prune the total result set down, and then only for 
the results inside that bounding box will the distance calculations for 
your {!geofilt} filter be applied.  (just make sure your bounding boxes 
overlap by at least as much as the max radius you search on, or you migh 
miss results when your search point is close to the edge of your grid)


-Hoss