You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by mark harwood <ma...@yahoo.co.uk> on 2006/02/07 19:28:24 UTC

Preventing "killer" queries

I've just been doing some benchmarking on a reasonably
large-scale system (38 million docs) and ran into an
issue where certain *very* common terms would
dramatically slow query responses. 
Some terms were abnormally common because I had
constructed the index by taking several copies and
merging them. Address data from this small sample area
had the county name reproduced massively.
Consequently a termQuery for the county name (with 50%
docFreq) in a scaled-up 38m doc index took 2 seconds
to return whereas most "normal" terms (<10% df) took a
matter of milliseconds.

Of course the solution for most situations is to use a
stop-word list at index time but that requires some
manual configuration and prior knowledge of the data
which isn't always ideal.

For these outlier situations is it worth adding a
"maxDf" property to TermQuery like BooleanQuery's
maxClause query-time control? I could fix my problem
in my own app-specific query construction code but I
wonder if others would find it a useful fix to add to
TermQuery in the Lucene core?


Cheers,
Mark






		
___________________________________________________________ 
To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Preventing "killer" queries

Posted by Chris Hostetter <ho...@fucit.org>.

Mark,

I know you've already commited a patch along these lines (LUCENE-494) and
I can see how in a lot of cases that would be a great solution, but i'm
still interested in the orriginal idea you proposed (a 'maxDf' in
TermQuery) because i anticipate situations in which you don't want to
ignore the common term as query time (because you want it to affect the
result set), you just don't want to bother spending a lot of time
calculating it's score contribution since it's so common -- perhaps even
if an optimization can get the time down, you don't want it included
because it's so common.

If i understand your description of the problem, in your profiling you've
confirmed that when a term is extremely common, the "tf" portion of the
calculation for each doc is expensive because of the underlying call to
TermDocs.read(int[],int[]) ... is that correct?

If that's the case, then it seems like a fairly straightforward and useful
patch would be to add the following (untested) to TermQuery...

    private static int maxDocFreq = Integer.MAX_INT;
    private static float macDocFreqRawScore = 0.0f;
    public static setMaxDocFreqScore(int df, float rawScore) {
        maxDocFreq = df;
        rawScore = macDocFreqRawScore;
    }
    public rewrite(IndexReader reader) {
       if (maxDocFreq < reader.docFreq(term)) {
          // should be ConstantScoreTermQuery but it doesn't exist
          Query q= new ConstantScoreRangeQuery(term.field(),term.text(),term.text(),true,true)
          q.setBoost(macDocFreqRawScore);
          return q.rewrite(reader);
       }
       return this;
    }


...the downside compared to your existing approach is that it's still
spending some time on the really common terms (build up the filter) so if
you truely wantto ignore them the analyzer is a better way to go -- but
the upside is that it would still allow those really common terms to
affect the result set.


   thoughts?



: Date: Tue, 07 Feb 2006 20:18:27 +0000
: From: markharw00d <ma...@yahoo.co.uk>
: Reply-To: java-dev@lucene.apache.org
: To: java-dev@lucene.apache.org
: Subject: Re: Preventing "killer" queries
:
: [Answering my own question]
:
: I think a reasonable solution is to have a generic analyzer for use at
: query-time that can wrap my application's choice of analyzer and
: automatically filter out what it sees as stop words. It would initialize
: itself from an IndexReader and create a StopFilter for those terms
: greater than a given document frequency.
:
: This approach seems reasonable because:
: a) The stop word filter is automatically adaptive and doesn't need
: manual tuning.
: b) I can live with the disk space overhead of the few "killer" terms
: which will make it into the index.
: c) "Silent" failure (ie removal of terms from query) is probably
: generally preferable to the throw-an-exception approach taken by
: BooleanQuery if clause limits are exceeded.
:
:
:
:
:
:
:
:
: ___________________________________________________________
: To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-dev-help@lucene.apache.org
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Preventing "killer" queries

Posted by markharw00d <ma...@yahoo.co.uk>.

[Answering my own question]

I think a reasonable solution is to have a generic analyzer for use at 
query-time that can wrap my application's choice of analyzer and 
automatically filter out what it sees as stop words. It would initialize 
itself from an IndexReader and create a StopFilter for those terms 
greater than a given document frequency.

This approach seems reasonable because:
a) The stop word filter is automatically adaptive and doesn't need 
manual tuning.
b) I can live with the disk space overhead of the few "killer" terms 
which will make it into the index.
c) "Silent" failure (ie removal of terms from query) is probably 
generally preferable to the throw-an-exception approach taken by 
BooleanQuery if clause limits are exceeded.







		
___________________________________________________________ 
To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Preventing "killer" queries

Posted by Doug Cutting <cu...@apache.org>.

mark harwood wrote:
> For these outlier situations is it worth adding a
> "maxDf" property to TermQuery like BooleanQuery's
> maxClause query-time control? I could fix my problem
> in my own app-specific query construction code but I
> wonder if others would find it a useful fix to add to
> TermQuery in the Lucene core?

Another approach is to use a TopDocCollector (in 1.9 only) and override 
the collect() method to, if too much time has transpired, throw an 
exception to stop the query with the results found thus far.

For an example of how to extend TopDocCollector, see:

http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/searcher/LuceneQueryOptimizer.java?view=markup

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org