You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Grant Ingersoll <gs...@apache.org> on 2009/11/14 20:30:34 UTC

Functions as Filters

Is it at all meaningful to think about a function query acting as a Filter?  The basic idea being that if the score was above/below some value (presumably 0 by default), then that particular document would be on/off.  Right now, Solr can take in functions as fq params, but they don't really do anything.

-Grant


Re: Functions as Filters

Posted by Grant Ingersoll <gs...@apache.org>.
On Nov 16, 2009, at 3:26 PM, Yonik Seeley wrote:

> On Mon, Nov 16, 2009 at 3:18 PM, Grant Ingersoll <gs...@apache.org> wrote:
>>> http://www.lucidimagination.com/blog/tag/frange/
>> 
>> I notice in the implementation that it assumes float.  What if I want double range?
> 
> That's the same generic problem as function query (float vs double vs
> int vs long)... we haven't solved it yet.

Right, but couldn't getRangeScorer() at least make the matchesValue calculation based on double instead of float?

Re: Functions as Filters

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Mon, Nov 16, 2009 at 3:18 PM, Grant Ingersoll <gs...@apache.org> wrote:
>> http://www.lucidimagination.com/blog/tag/frange/
>
> I notice in the implementation that it assumes float.  What if I want double range?

That's the same generic problem as function query (float vs double vs
int vs long)... we haven't solved it yet.

-Yonik
http://www.lucidimagination.com

Re: Functions as Filters

Posted by Grant Ingersoll <gs...@apache.org>.
On Nov 15, 2009, at 5:07 PM, Yonik Seeley wrote:

> On Sat, Nov 14, 2009 at 2:30 PM, Grant Ingersoll <gs...@apache.org> wrote:
>> Is it at all meaningful to think about a function query acting as a Filter?  The basic idea being that if the score was above/below some value (presumably 0 by default), then that particular document would be on/off.
> 
> frange?
> 
> http://www.lucidimagination.com/blog/tag/frange/

I notice in the implementation that it assumes float.  What if I want double range?

-Grant

Re: Functions as Filters

Posted by Ryan McKinley <ry...@gmail.com>.
>
>
>> so I don't think any sort of
>> generic cache is needed for geo.
>
>
> Agreed, no generic cache for geo.   Was thinking about a generic  
> cache for function calculations.

I think even more generally would be good -- an easy way to share  
calculations between anything in the request cycle:  function query,  
search components, request handlers, response writers.

To avoid adding dependencies to stuff that does not need it, perhaps  
it makes sense to use the 'inform' model we have for SolrCoreAware  
type things.

Perhaps (bad name as i'm just throwing stuff out there)  
SharedContextAware / inform( SharedContext )

ryan


>  Potentially, we could have the need: sort, filter, facet and boost  
> by function.  And that calculation is likely the same over and over  
> within a given request.  Of course, if we add the pseudo-fields,  
> then that effectively acts as a cache for the request.
>
> -Grant


Re: Functions as Filters

Posted by Grant Ingersoll <gs...@apache.org>.
On Nov 16, 2009, at 9:20 AM, Yonik Seeley wrote:

> On Mon, Nov 16, 2009 at 8:23 AM, Grant Ingersoll <gs...@apache.org> wrote:
>> One of the other things I think we are going to need is a cache for functions that are used this way.  For instance, in the geo case, it is likely that we would both filter and score by distance,
> 
> Filtering (bounding box) should be a separate, more efficient
> operation than calculating distance,

In some cases, potentially, but I think it is going to depend on the application.

> so I don't think any sort of
> generic cache is needed for geo.


Agreed, no generic cache for geo.   Was thinking about a generic cache for function calculations.  Potentially, we could have the need: sort, filter, facet and boost by function.  And that calculation is likely the same over and over within a given request.  Of course, if we add the pseudo-fields, then that effectively acts as a cache for the request.

-Grant

Re: Functions as Filters

Posted by Grant Ingersoll <gs...@apache.org>.
On Nov 17, 2009, at 10:35 AM, Yonik Seeley wrote:

> On Mon, Nov 16, 2009 at 9:20 AM, Yonik Seeley
> <yo...@lucidimagination.com> wrote:
>> On Mon, Nov 16, 2009 at 8:23 AM, Grant Ingersoll <gs...@apache.org> wrote:
>>> One of the other things I think we are going to need is a cache for functions that are used this way.  For instance, in the geo case, it is likely that we would both filter and score by distance,
>> 
>> Filtering (bounding box) should be a separate, more efficient
>> operation than calculating distance, so I don't think any sort of
>> generic cache is needed for geo.
> 
> Actually, you're right.
> I was thinking of filtering by a bounding box, but people will also
> want to filter by a radius (which should presumably use bounding boxes
> first to limit the number of points that we calculate the distance
> for).

Yep, I think frange actually works quite nice for this case.

> 
> If someone then also sorts, the distance calculation won't be reused.
> I don't know a good way around that currently... a full cache would be
> pretty expensive memory-wise.

Right, we don't want a full cache that lives on like the other caches.  We likely could just either shove the info onto the document or shove a Map onto the Request object itself.  Going back to my servlet days, I often just used ServletRequest attributes for this kind of thing or some other request specific context.

> 
> Actually, perhaps there wouldn't be too much wasted calculation after all?
> Seems like additional optimizations could limit how many points need
> distance calculated for filtering?
> 
> Consider a bounding box for a particular radius... one could also find
> a box that lies completely within that radius.  Only points inside the
> bigger box but outside the smaller box need to have a distance
> calculated.
> 
> Also, if one is sorting by distance anyway, a straight bounding box
> filter may be sufficient (i.e. users should have the option of the
> cheaper or more expensive filter).


It's not just sorting, though, you could also want that function calculation for faceting, scoring and maybe sorting.

In reality of a spatial application, I think it is fairly common to say, all in one request:
1. Filter by distance/bounding box
2. Within the box, boost the score based on distance from center point and return the score
3. Return me out the distance from the center point as a field value (pseudo-fields)
4. Facet by function (i.e. distance) and put them in buckets (all docs in walking dist, cycling dist, driving distance, everything else) 
5. Sort by distance (in many cases, this one and #2 will be mutually exclusive, but not in all cases)

If you take a very dense geographical area, like Manhattan, you could still have hundreds of thousands, if not millions, of points all in a radius of 10 or 20 miles such that not calculating that distance more than once is going to be paramount to success.  

-Grant



Re: Functions as Filters

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Mon, Nov 16, 2009 at 9:20 AM, Yonik Seeley
<yo...@lucidimagination.com> wrote:
> On Mon, Nov 16, 2009 at 8:23 AM, Grant Ingersoll <gs...@apache.org> wrote:
>> One of the other things I think we are going to need is a cache for functions that are used this way.  For instance, in the geo case, it is likely that we would both filter and score by distance,
>
> Filtering (bounding box) should be a separate, more efficient
> operation than calculating distance, so I don't think any sort of
> generic cache is needed for geo.

Actually, you're right.
I was thinking of filtering by a bounding box, but people will also
want to filter by a radius (which should presumably use bounding boxes
first to limit the number of points that we calculate the distance
for).

If someone then also sorts, the distance calculation won't be reused.
I don't know a good way around that currently... a full cache would be
pretty expensive memory-wise.

Actually, perhaps there wouldn't be too much wasted calculation after all?
Seems like additional optimizations could limit how many points need
distance calculated for filtering?

Consider a bounding box for a particular radius... one could also find
a box that lies completely within that radius.  Only points inside the
bigger box but outside the smaller box need to have a distance
calculated.

Also, if one is sorting by distance anyway, a straight bounding box
filter may be sufficient (i.e. users should have the option of the
cheaper or more expensive filter).

-Yonik
http://www.lucidimagination.com

Re: Functions as Filters

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Mon, Nov 16, 2009 at 8:23 AM, Grant Ingersoll <gs...@apache.org> wrote:
> One of the other things I think we are going to need is a cache for functions that are used this way.  For instance, in the geo case, it is likely that we would both filter and score by distance,

Filtering (bounding box) should be a separate, more efficient
operation than calculating distance, so I don't think any sort of
generic cache is needed for geo.

-Yonik
http://www.lucidimagination.com

Re: Functions as Filters

Posted by Grant Ingersoll <gs...@apache.org>.
On Nov 15, 2009, at 8:32 PM, Grant Ingersoll wrote:

> 
> On Nov 15, 2009, at 5:07 PM, Yonik Seeley wrote:
> 
>> On Sat, Nov 14, 2009 at 2:30 PM, Grant Ingersoll <gs...@apache.org> wrote:
>>> Is it at all meaningful to think about a function query acting as a Filter?  The basic idea being that if the score was above/below some value (presumably 0 by default), then that particular document would be on/off.
>> 
>> frange?
>> 
>> http://www.lucidimagination.com/blog/tag/frange/
> 
> This might have legs...  In combination with the new Distance functions I committed yesterday:
> 
> http://localhost:8983/solr/select/?q=*:*&fq={!frange%20l=0%20u=5}dist%282,%2032,%20-80,%20lat,%20lon%29
> 
> As I see it, I can now filter, boost and sort (using the ^0 workaround) by distance using Function queries.  I've only tested on a small index so far (68K geo-locations), but will try to use something bigger once I get it indexed.  

One of the other things I think we are going to need is a cache for functions that are used this way.  For instance, in the geo case, it is likely that we would both filter and score by distance, so it would be nice to have them cached.  Or, at least cached for the life of the request.  Not sure if we need a longer term cache or not.  It would also be nice to be able to say "don't cache this" on a per request basis.

-Grant


Re: Functions as Filters

Posted by Grant Ingersoll <gs...@apache.org>.
On Nov 15, 2009, at 5:07 PM, Yonik Seeley wrote:

> On Sat, Nov 14, 2009 at 2:30 PM, Grant Ingersoll <gs...@apache.org> wrote:
>> Is it at all meaningful to think about a function query acting as a Filter?  The basic idea being that if the score was above/below some value (presumably 0 by default), then that particular document would be on/off.
> 
> frange?
> 
> http://www.lucidimagination.com/blog/tag/frange/

This might have legs...  In combination with the new Distance functions I committed yesterday:

http://localhost:8983/solr/select/?q=*:*&fq={!frange%20l=0%20u=5}dist%282,%2032,%20-80,%20lat,%20lon%29

As I see it, I can now filter, boost and sort (using the ^0 workaround) by distance using Function queries.  I've only tested on a small index so far (68K geo-locations), but will try to use something bigger once I get it indexed.  

-Grant

Re: Functions as Filters

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Sat, Nov 14, 2009 at 2:30 PM, Grant Ingersoll <gs...@apache.org> wrote:
> Is it at all meaningful to think about a function query acting as a Filter?  The basic idea being that if the score was above/below some value (presumably 0 by default), then that particular document would be on/off.

frange?

http://www.lucidimagination.com/blog/tag/frange/

-Yonik
http://www.lucidimagination.com