You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Jay Hill <ja...@gmail.com> on 2012/01/27 23:25:44 UTC

Complex query, need filtering after query not before

I have a project where we need to search 1B docs and still have results <
700ms. The problem is, we are using geofiltering and that is happening *
before* the queries, so we have to geofilter on the 1B docs to restrict our
set of docs first, and then do the query on a name field. But it seems that
it would be better and faster to run the main query first, and only then
filter out that subset of docs by geo. Here is what a typical query looks
like:

?shards=<list of 20 nodes>
&q={!boost
b=sum(recip(geodist(geo_lat_long,38.2493581,-122.0399663),1,1,1))}(given_name:Barack
OR given_name_exact:Barack^4.0) AND family_name:Obama
&fq={!geofilt pt=38.2493581,-122.0399663 sfield=geo_lat_long d=120}
&fq=(-source:somedatasource)
&rows=4
QTime=1040

I've looked at the "cache=false" param, and the "cost=" param, but that's
not going to help much because we still have to do the filtering. (We
*will* use
"cache=false" to avoid the overhead of caching queries that will very
rarely be the same.)

Is there any way to indicate a filter query should happen *after* the other
results? The other fq on source restricts the docset somewhat, but
different variations don't eliminate a high number of docs, so we could use
the "cost" param to run the fq on source before the fq on geo, but it would
only help very minimally in some cases.


Thanks,
-Jay

Re: Complex query, need filtering after query not before

Posted by Chris Hostetter <ho...@fucit.org>.
: 700ms. The problem is, we are using geofiltering and that is happening *
: before* the queries, so we have to geofilter on the 1B docs to restrict our
: set of docs first, and then do the query on a name field. But it seems that

	...

: I've looked at the "cache=false" param, and the "cost=" param, but that's
: not going to help much because we still have to do the filtering. (We
: *will* use
: "cache=false" to avoid the overhead of caching queries that will very
: rarely be the same.)
: 
: Is there any way to indicate a filter query should happen *after* the other
: results? The other fq on source restricts the docset somewhat, but

that's what the "cost" param does, it tels Solr the order to evaluate the 
filters, and won't bother asking a filter with cost "50" to evalaute a doc 
that a filter with cost "10" has already ruled out.

One thing you may be overlooking is this little bit of juicy goodness...

http://wiki.apache.org/solr/CommonQueryParameters#Caching_of_filters

>> As an additional feature for very high cost filters, if cache=false and 
>> cost>=100 and the query implements the PostFilter interface, a 
>> Collector will be requested from that query and used to filter 
>> documents only after they have matched the main query and all other 
>> filter queries.  

...at the moment only frange implements PostFilter, but maybe using that 
as a model you could patch geofilt to implement it?

hell: couldn't you rewrite your fq={!geofilt ...} to be an fq={!frange 
...}geodist(...) w/o any code changes?


-Hoss

Re: Complex query, need filtering after query not before

Posted by Mikhail Khludnev <mk...@griddynamics.com>.
Hello Jay,

You can lose some precision in favour of performance: reducing precision of
coordinates (by putting them onto grid) you can increase hit ratio; then
try bbox for faster rough filtration
http://wiki.apache.org/solr/SpatialSearch#bbox_-_Bounding-box_filter
and apply geodist() function in frange to reduce amount of calculations
&q={!frange l=0
u=5}geodist()<http://localhost:8983/solr/select?q=*:*&sfield=store&pt=45.15,-93.85&facet.query={!frange%20l=0%20u=5}geodist()&facet.query={!frange%20l=5.001%20u=3000}geodist()&wt=xml&facet=true>

Regards

On Sat, Jan 28, 2012 at 2:25 AM, Jay Hill <ja...@gmail.com> wrote:

> I have a project where we need to search 1B docs and still have results <
> 700ms. The problem is, we are using geofiltering and that is happening *
> before* the queries, so we have to geofilter on the 1B docs to restrict our
> set of docs first, and then do the query on a name field. But it seems that
> it would be better and faster to run the main query first, and only then
> filter out that subset of docs by geo. Here is what a typical query looks
> like:
>
> ?shards=<list of 20 nodes>
> &q={!boost
>
> b=sum(recip(geodist(geo_lat_long,38.2493581,-122.0399663),1,1,1))}(given_name:Barack
> OR given_name_exact:Barack^4.0) AND family_name:Obama
> &fq={!geofilt pt=38.2493581,-122.0399663 sfield=geo_lat_long d=120}
> &fq=(-source:somedatasource)
> &rows=4
> QTime=1040
>
> I've looked at the "cache=false" param, and the "cost=" param, but that's
> not going to help much because we still have to do the filtering. (We
> *will* use
> "cache=false" to avoid the overhead of caching queries that will very
> rarely be the same.)
>
> Is there any way to indicate a filter query should happen *after* the other
> results? The other fq on source restricts the docset somewhat, but
> different variations don't eliminate a high number of docs, so we could use
> the "cost" param to run the fq on source before the fq on geo, but it would
> only help very minimally in some cases.
>
>
> Thanks,
> -Jay
>



-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>