You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Jochen Frey <lu...@quontis.com> on 2004/03/25 18:47:38 UTC

Query optimizer - Cost of Queries

Hi There!

We are in the process of building a query optimizer for Lucene RangeQueries
(we need that because we run fairly complex Range queries with a few hundred
terms against large corpuses, and response time needs improvement). We have
written a framework that allows for traversing queries and rearranging /
recreating subqueries.

In a next step, we tried to find criteria to optimize. A Simple one is to
reduce the total number of terms in the query. 

Question 1: Is it a good idea to minimize the # of terms.


Some optimization options however leave the choice of which term to reduce.
In order to make that choice we are using a fairly simple cost estimator for
queries and terms (currently we only deal with SpanNearQuery, SpanOrQuery
and SpanTermQuery)

SpanNearQuery: 10 - #of clauses + total of the cost of all clauses
SpanOrQuery: 10 + total of the cost of all clauses 
SpanTermQuery: 1 over #of characters in the term 

Question 2: Does anyone have better cost estimates or comments about this?


This optimization is all happening client side (i.e. as of the writing of
this, the optimizer does not know the statistics for tokens actually stored
in the index). 

Question 3: How do I get access to Term frequencies (i.e. the number of
times a given Term appears in the index). I assume that the way to go is
getTermFreqVectors in IndexWriter. This should allow for better choices as
to which term to eliminate.

Question 4: What are good cost estimates assuming that we have term
frequencies available?


And yes, if all of this ends up working we'll make the code available to the
project.

Cheers,
	Jochen



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Query optimizer - Cost of Queries

Posted by Ype Kingma <yk...@xs4all.nl>.

Jochen,

I'm afraid I didn't understand your post fully. Nevertheless,
did you consider adding prefix terms (in a separate field) as normal terms
to your index?
Eg. suppose your terms are nrs ranging 0000 to 9999 you
could search the range 0250-0302 by prefixes indexed as terms:
025 026 027 028 029 0300 0301 0302
instead of all 53 terms separately, probably saving quite a few
disk head seeks for the range query.

How are the ranges and the spans related?

Kind regards,
Ype

On Thursday 25 March 2004 18:47, Jochen Frey wrote:
> Hi There!
>
> We are in the process of building a query optimizer for Lucene RangeQueries
> (we need that because we run fairly complex Range queries with a few
> hundred terms against large corpuses, and response time needs improvement).
> We have written a framework that allows for traversing queries and
> rearranging / recreating subqueries.
>
> In a next step, we tried to find criteria to optimize. A Simple one is to
> reduce the total number of terms in the query.
>
> Question 1: Is it a good idea to minimize the # of terms.
>
> Some optimization options however leave the choice of which term to reduce.
> In order to make that choice we are using a fairly simple cost estimator
> for queries and terms (currently we only deal with SpanNearQuery,
> SpanOrQuery and SpanTermQuery)
>
> SpanNearQuery: 10 - #of clauses + total of the cost of all clauses
> SpanOrQuery: 10 + total of the cost of all clauses
> SpanTermQuery: 1 over #of characters in the term
>
> Question 2: Does anyone have better cost estimates or comments about this?
>
> This optimization is all happening client side (i.e. as of the writing of
> this, the optimizer does not know the statistics for tokens actually stored
> in the index).
>
> Question 3: How do I get access to Term frequencies (i.e. the number of
> times a given Term appears in the index). I assume that the way to go is
> getTermFreqVectors in IndexWriter. This should allow for better choices as
> to which term to eliminate.
>
> Question 4: What are good cost estimates assuming that we have term
> frequencies available?
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Query optimizer - Cost of Queries

Posted by Doug Cutting <cu...@apache.org>.

Jochen Frey wrote:
> I believe TermDocs(t).freq()) gives me the number of documents in which the
> term appears, as opposed to the total number of times the term appears. Any
> particular reason for choosing that metric (it seems less accurate, but
> maybe it's the only one that can be easily retrieved).

That's correct.  This is computed because it is required for IDF 
weighting, the most common a priori term weighting technique.

> As for the costs mentioned below. I'd assume that they be different from
> machine to machine.

They probably vary somewhat, but are probably fairly proportional.  If 
one factor is 100 times another on one machine, it will still probably 
be considerably larger than the other in other situations.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

RE: Query optimizer - Cost of Queries

Posted by Jochen Frey <lu...@quontis.com>.

Doug,

I believe TermDocs(t).freq()) gives me the number of documents in which the
term appears, as opposed to the total number of times the term appears. Any
particular reason for choosing that metric (it seems less accurate, but
maybe it's the only one that can be easily retrieved).

---

As for the costs mentioned below. I'd assume that they be different from
machine to machine. If currently we don't have the time for determining the
exact costs, do you have assumptions to start with (or magnitudes, for
example  k_tq >> j_tq etc.)? If not, we might go ahead with setting all the
constants to 1, as a starting point.

> The general cost of executing a TermQuery is something like:
> 
>     k_tq + j_tq * IndexReader.docFreq(t)
> 
> Where t is the term, and k_tq and j_tq are constants that need to be
> determined.  k_tq mostly represents the cost of looking the term up in
> the dictionary, and j_tq represents the cost of scoring each document.
> 
> The cost of a BooleanQuery is something like
> 
>     sum ( cost(c) +  k_bq )
>      c

> A worst-case cost function for SpanNearQuery is something like:
> 
>    sum (k_snq * cost(c) * log(|q|) )
>     c
> 
> where |q| is the number of terms in the query.  Things should be better
> than this, however, if the skipTo() optimization fires.  This is
> effective when one or more of the terms occurs less than 1/16th as many
> documents as another, and can provide up-to a 16-fold speedup.
> 
> > SpanOrQuery: 10 + total of the cost of all clauses
> 
>    sum ( k_soq * cost(c) * log(|q|) )
>     c
> 
> But here the skipTo() optimization is not effective.
> 
> > SpanTermQuery: 1 over #of characters in the term
> 
>     k_stq + j_stq * sum (TermDocs(t).freq())
>                      d


As always, thanks for your input,
Jochen






---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Re: Query optimizer - Cost of Queries

Posted by Doug Cutting <cu...@apache.org>.

Jochen Frey wrote:
> We are in the process of building a query optimizer for Lucene RangeQueries
> (we need that because we run fairly complex Range queries with a few hundred
> terms against large corpuses, and response time needs improvement). We have
> written a framework that allows for traversing queries and rearranging /
> recreating subqueries.
> 
> In a next step, we tried to find criteria to optimize. A Simple one is to
> reduce the total number of terms in the query. 
> 
> Question 1: Is it a good idea to minimize the # of terms.

Yes.  A RangeQuery expands into a BooleanQuery of TermQuerys.

The general cost of executing a TermQuery is something like:

    k_tq + j_tq * IndexReader.docFreq(t)

Where t is the term, and k_tq and j_tq are constants that need to be 
determined.  k_tq mostly represents the cost of looking the term up in 
the dictionary, and j_tq represents the cost of scoring each document.

The cost of a BooleanQuery is something like

    sum ( cost(c) +  k_bq )
     c

Where c is each clause in the query and k_bq is a constant.

You could perform some timing experiments to determine approximate 
values of these constants.

> Some optimization options however leave the choice of which term to reduce.
> In order to make that choice we are using a fairly simple cost estimator for
> queries and terms (currently we only deal with SpanNearQuery, SpanOrQuery
> and SpanTermQuery)
> 
> SpanNearQuery: 10 - #of clauses + total of the cost of all clauses

A worst-case cost function for SpanNearQuery is something like:

   sum (k_snq * cost(c) * log(|q|) )
    c

where |q| is the number of terms in the query.  Things should be better 
than this, however, if the skipTo() optimization fires.  This is 
effective when one or more of the terms occurs less than 1/16th as many 
documents as another, and can provide up-to a 16-fold speedup.

> SpanOrQuery: 10 + total of the cost of all clauses 

   sum ( k_soq * cost(c) * log(|q|) )
    c

But here the skipTo() optimization is not effective.

> SpanTermQuery: 1 over #of characters in the term 

    k_stq + j_stq * sum (TermDocs(t).freq())
                     d

> Question 3: How do I get access to Term frequencies (i.e. the number of
> times a given Term appears in the index).

IndexReader.docFreq(t)

> Question 4: What are good cost estimates assuming that we have term
> frequencies available?

See above.

> And yes, if all of this ends up working we'll make the code available to the
> project.

Great!  I look forward to seeing it!

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org