You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by Mike Klaas <mi...@gmail.com> on 2007/11/07 22:10:43 UTC

Intuition check

Not sure if this went through--I sent using the wrong email addr.

I've also posted a patch:
  [ https://issues.apache.org/jira/browse/SOLR-407? 
page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

----------------------------------------------------------------------

I'm thinking of implementing something like "fq.nocache": a query  
restriction that would not be cached in filterCache.

This is actually somewhat tricky to implement in the Guts of  
SolrIndexSearcher without adding a lot to the interface.  Then I  
realized that without caching, there really is no point in computing  
separate BitSets and intersecting them: it would be better to insert  
them directly into the BooleanQuery (or a wrapping BooleanQuery).   
This would also provide nicer skipTo behaviour (perhaps most docs  
needn't be scored before applying the filter).

Is there any reason to think that this wouldn't be the most efficient  
solution?  Obviously the clauses would be inserted with 0 boost.

cheers,
-Mike

Re: Intuition check

Posted by Mike Klaas <mi...@gmail.com>.
On 14-Nov-07, at 12:00 AM, Paul Elschot wrote:

> On Wednesday 14 November 2007 06:42:56 Mike Klaas wrote:
> ...
>> ...  I was
>> imagining that it was necessary to embed the filter clauses in the
>> "core" to produce an effective implementation.  By the time I
>> finished my response, I had read enough of the relevant Lucene scorer
>> code (in particular, ReqOptScorer) to realize that the benefits would
>> be had using an outer-layer ConjunctionQuery as well.
>
> I have a extension of Lucene's ConjunctionSumScorer that allows
> filter clauses, but it's waiting for
> http://issues.apache.org/jira/browse/LUCENE-584 .

LUCENE-584 is the ideal means of implementing this, though it seems  
that it will be in 2.4 at the earliest.  It is cool stuff though--I  
added a vote.

-Mike

Re: Intuition check

Posted by Paul Elschot <pa...@xs4all.nl>.
On Wednesday 14 November 2007 06:42:56 Mike Klaas wrote:
...
> ...  I was
> imagining that it was necessary to embed the filter clauses in the
> "core" to produce an effective implementation.  By the time I
> finished my response, I had read enough of the relevant Lucene scorer
> code (in particular, ReqOptScorer) to realize that the benefits would
> be had using an outer-layer ConjunctionQuery as well.

I have a extension of Lucene's ConjunctionSumScorer that allows
filter clauses, but it's waiting for
http://issues.apache.org/jira/browse/LUCENE-584 .

Regards,
Paul Elschot

Re: Intuition check

Posted by Yonik Seeley <yo...@apache.org>.
On Nov 10, 2007 8:48 PM, Chris Hostetter <ho...@fucit.org> wrote:
> : I've since considered trying out a SortedIntSet since they would be
> : both smaller, and usable in skipTo.
>
> If you think it's worth doing, then it probably is.

It's complicated...
- SortedIntSet would be about 44% smaller on average
- random lookups would be significantly slower (but it's unclear how
many random lookups need to be done)
- intersection of 2 SortedIntSets of near equal size should be slightly faster
- intersection of 2 SortedIntSets of different sizes will be slower
- intersection of a SortedIntSet with a BitSet will be slightly faster
- intersection of a small uncached int list with a large SortedIntSet
will be slower (think facet.enum.cache.minDf)

Anyway, this is just something to keep in mind...  there are bigger
fish to fry right now.

-Yonik

Re: Intuition check

Posted by Chris Hostetter <ho...@fucit.org>.
: >   2) "here is an 'fq'" ... in which we get the DocSet and add it to the
: >      main query if it's small.
: 
: One issue is that HashDocSet would need to first be sorted, but that
: should hopefully be relatively quick.

Oh...  hmm, yeah i thought DocIterators already garunteed order.  but 
either way, the DocSets we are dealing with should be small, so hopefully 
the sorting is fast.

: I've since considered trying out a SortedIntSet since they would be
: both smaller, and usable in skipTo.

If you think it's worth doing, then it probably is. 

(over the years i've seen more then enough evidence that "yonik with his 
performance hat on" is just as much of an expert in his field as "yonik 
with his thread safty hat on")



-Hoss


Re: Intuition check

Posted by Yonik Seeley <yo...@apache.org>.
On Nov 8, 2007 7:34 PM, Chris Hostetter <ho...@fucit.org> wrote:
>   2) "here is an 'fq'" ... in which we get the DocSet and add it to the
>      main query if it's small.

One issue is that HashDocSet would need to first be sorted, but that
should hopefully be relatively quick.
Background: HashDocSet was developed at a time when Lucene didn't
deliver docs in sorted order anyway... it would have required extra
time to sort, and lookups would be slower (binary search vs hash).
I've since considered trying out a SortedIntSet since they would be
both smaller, and usable in skipTo.

-Yonik

Re: Intuition check

Posted by Mike Klaas <mi...@gmail.com>.
On 8-Nov-07, at 4:34 PM, Chris Hostetter wrote:

>
> : First, how to determine whether the filter-embedding would be  
> effective?  We
> 	...
> : really available.  It can be estimated assuming the filter and  
> query are
> : independent, but this definitely isn't always true.  If the filter
>
> I was assuming we could use a simple hueristic...
>     if( configOption < docSet.size()/numDocs() )

Another case that comes to mind is if the matching query is a  
MatchAllDocsQuery, in which case the filter should probably be used  
directly.

> : Second, embedding the filter itself.  This is much more  
> nettlesome within
> : SolrIndexSearcher than within one of the request handlers.  One  
> problem is the
>
> really?  why should it be?

Sorry, that sentence was the product of thinking-while-responding,  
which is always a recipe for being wrong <g>.  I had a particular  
query structure in mind, one that had the matching clauses embedded  
in the inner "core" of the query with several layers of score  
modification queries wrapped on top of this (e.g. dismax's various  
boost queries; yonik's multiplicative boost queries).  I was  
imagining that it was necessary to embed the filter clauses in the  
"core" to produce an effective implementation.  By the time I  
finished my response, I had read enough of the relevant Lucene scorer  
code (in particular, ReqOptScorer) to realize that the benefits would  
be had using an outer-layer ConjunctionQuery as well.

> anything the request handler can do to much with the Query object
> SolrIndexSearcher can do as well .. and by the time
> getDocListNC/getDocListAndSetNC are called the "pure negative"  
> issues are
> alrady resolved.
>
> The only difference is that in those methods we already have a DocSet
> (instead of a Query) but it should be easy to wrap a DocSet in a  
> Query to
> add to the main query.
>
> : ISTM then that the main challenge is in determining when the filter
> : intersection should be embedded.  Also, the ability to control  
> filter caching
> : is still difficult with this implementation, but perhaps that's less
> : important.
>
> yeah ... it seems like there are two orthoginal use cases...
>   1) "here is an 'fq', i know it's not worth caching" ... in which we
>      don't put it in the filterCache.
>   2) "here is an 'fq'" ... in which we get the DocSet and add it to  
> the
>      main query if it's small.
>
> for any given input, 1 and 2 might both apply, or just 1, or just 2

True.  I'm tempted to implement a <!nocache> directive via embedding  
(without advertising the fact), and work on the fq optimization  
separately.

Thanks,
-Mike

Re: Intuition check

Posted by Chris Hostetter <ho...@fucit.org>.
: First, how to determine whether the filter-embedding would be effective?  We
	...
: really available.  It can be estimated assuming the filter and query are
: independent, but this definitely isn't always true.  If the filter

I was assuming we could use a simple hueristic...
    if( configOption < docSet.size()/numDocs() ) 

: Second, embedding the filter itself.  This is much more nettlesome within
: SolrIndexSearcher than within one of the request handlers.  One problem is the

really?  why should it be?

anything the request handler can do to much with the Query object 
SolrIndexSearcher can do as well .. and by the time 
getDocListNC/getDocListAndSetNC are called the "pure negative" issues are 
alrady resolved.

The only difference is that in those methods we already have a DocSet 
(instead of a Query) but it should be easy to wrap a DocSet in a Query to 
add to the main query.

: ISTM then that the main challenge is in determining when the filter
: intersection should be embedded.  Also, the ability to control filter caching
: is still difficult with this implementation, but perhaps that's less
: important.

yeah ... it seems like there are two orthoginal use cases...
  1) "here is an 'fq', i know it's not worth caching" ... in which we 
     don't put it in the filterCache. 
  2) "here is an 'fq'" ... in which we get the DocSet and add it to the 
     main query if it's small. 

for any given input, 1 and 2 might both apply, or just 1, or just 2



-Hoss


Re: Intuition check

Posted by Mike Klaas <mi...@gmail.com>.
On 8-Nov-07, at 8:59 AM, Chris Hostetter wrote:

> Let's back up a second...
>
> the theory is that while it's frequently handy to cache fq's  
> independent
> of the main query (because they are probably used over and over) in  
> some
> cases it may be advantageous to use an FQ directly in the body of hte
> main query to get better skipTo behavior. -- the fundemental issue is
> orthogonal to wether or not a DocSet for the FQ is cached, the  
> question
> is how should that FQ be used when computing the final DocList.
>
> So what if instead of letting the client say "this argument is an  
> fq which
> should be used to generate a BitSet and cached as a filter, this  
> argument
> is an fq.nocache which should be added to the main query" we  
> instead make
> SolrIndexSearcher smart enough to say "i've been asked to filter the
> main query using some DocSets, the intersection of those DocSets is  
> small
> enough, that instead of filtering the query on it, i'm going to add a
> query that only matches docs in it to the main query to improve skipTo
> behavior." ... so now clients don't have to know, they just pass in a
> bunch of fq params.   we still cache a DocSet for each one, but  
> when it
> comes time to do the search, we get the skipTo benefit anytime the
> intersection of all fqs is really small (wether the individual fqs are
> small enough individually or not)

I agree that this would be awesome if it can be pulled off.

> that should just be a simple change to getDocListNC right?

Let's think about this: To effectively do what you suggest, the query  
handling needs to

1. determine whether a given (set of) filter(s) would be effective in  
a skipTo context
2. embed the filter in the query as a scorer

I see difficulties with both, but perhaps they are not unsurmountable.

First, how to determine whether the filter-embedding would be  
effective?  We have at our disposal the size of the filter- 
intersection, assuming they are cached.  The most important criterion  
here is probably the relative size difference of the result set with  
the filter applied or not, which isn't really available.  It can be  
estimated assuming the filter and query are independent, but this  
definitely isn't always true.  If the filter isn't/shouldn't be  
cached?   You have to compute it separately for this (avoiding that  
is part of the goal).

Second, embedding the filter itself.  This is much more nettlesome  
within SolrIndexSearcher than within one of the request handlers.   
One problem is the use of BooleanScorer--I suppose we could detect  
that by walking the query tree looking for it.  Another is the  
embedding location: if filters are embedded in SIS, then then only  
reasonable option is to wrap everything in another top-level  
BooleanScorer with the original and filter query as required clauses  
(perhaps the filter would be inserted as prohibited if the inverse  
bitset was sufficiently sparse).  This means that the next()'s that  
happen to occur on the original query will pull in lots of extra  
scoring that might not be needed: bq's, bf's, pf's, whatever else is  
layered on the scoring (in my case, there are be 1-2 layers of  
multiplicative boosts as well).  It is nice to insert the filters  
directly into the "matching" part of the query.

Actually, nevermind: ReqOptSumScorer does not pull ahead its optional  
scorers until .score() is called, so the effects should be largely  
the same.

ISTM then that the main challenge is in determining when the filter  
intersection should be embedded.  Also, the ability to control filter  
caching is still difficult with this implementation, but perhaps  
that's less important.

Thanks for the feedback,
-Mike

Re: Intuition check

Posted by Chris Hostetter <ho...@fucit.org>.
: > Is there any reason to think that this wouldn't be the most efficient
: > solution?  Obviously the clauses would be inserted with 0 boost.
: 
: I was going to suggest that the client could always just add the extra
: clause themselves, but then I thought about dismax...

there' a broader issue as well ... people might want the "q" to come from 
one place (ie: user text box input) while the "fq.nocache" comes from 
somewhere else (ie: default configs, radio button option, etc...).

Let's back up a second...

the theory is that while it's frequently handy to cache fq's independent 
of the main query (because they are probably used over and over) in some 
cases it may be advantageous to use an FQ directly in the body of hte 
main query to get better skipTo behavior. -- the fundemental issue is 
orthogonal to wether or not a DocSet for the FQ is cached, the question 
is how should that FQ be used when computing the final DocList.

So what if instead of letting the client say "this argument is an fq which 
should be used to generate a BitSet and cached as a filter, this argument 
is an fq.nocache which should be added to the main query" we instead make 
SolrIndexSearcher smart enough to say "i've been asked to filter the 
main query using some DocSets, the intersection of those DocSets is small 
enough, that instead of filtering the query on it, i'm going to add a 
query that only matches docs in it to the main query to improve skipTo 
behavior." ... so now clients don't have to know, they just pass in a 
bunch of fq params.   we still cache a DocSet for each one, but when it 
comes time to do the search, we get the skipTo benefit anytime the 
intersection of all fqs is really small (wether the individual fqs are 
small enough individually or not)


that should just be a simple change to getDocListNC right?



-Hoss


Re: Intuition check

Posted by Yonik Seeley <yo...@apache.org>.
On Nov 7, 2007 4:10 PM, Mike Klaas <mi...@gmail.com> wrote:
> Not sure if this went through--I sent using the wrong email addr.
>
> I've also posted a patch:
>   [ https://issues.apache.org/jira/browse/SOLR-407?
> page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> ----------------------------------------------------------------------
>
> I'm thinking of implementing something like "fq.nocache": a query
> restriction that would not be cached in filterCache.
>
> This is actually somewhat tricky to implement in the Guts of
> SolrIndexSearcher without adding a lot to the interface.  Then I
> realized that without caching, there really is no point in computing
> separate BitSets and intersecting them: it would be better to insert
> them directly into the BooleanQuery (or a wrapping BooleanQuery).
> This would also provide nicer skipTo behaviour (perhaps most docs
> needn't be scored before applying the filter).

Right.  This can speed things up if the filter matches few documents.

> Is there any reason to think that this wouldn't be the most efficient
> solution?  Obviously the clauses would be inserted with 0 boost.

I was going to suggest that the client could always just add the extra
clause themselves, but then I thought about dismax...

Rather than fq.nocache, how about decorating the existing fq via localParams:

fq=<!cache=false>

We already look for localParams anyway for query type info... so you
could also have something like:

fq=<!prefix f=myfield cache=false>foo

-Yonik