You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by "Russell M. Allen" <Ru...@aebn.net> on 2006/07/14 20:59:08 UTC

PrefixQuery rewrite() bug, ignores max clause count

We recently ran into an issue while executing a simple prefix search
"name:b*", which results in a BooleanQuery$TooManyClauses exception.  At
first I found it odd that a single clause query was causing this, but as
I dug into the code I found where the PrefixQuery rewrites itself as a
BooleanQuery.  Unfortunately, it doesn't respect the maxClauseCount of
the BooleanQuery in the process.  Thus, when we hit a sufficiently large
number of results, this causes the TooManyClauses exception that a
number of people have posted about.
 
Here is the code now (1.9.1):

	public class PrefixQuery extends Query {
	...
	  public Query rewrite(IndexReader reader) throws IOException {
	    BooleanQuery query = new BooleanQuery(true);
	    TermEnum enumerator = reader.terms(prefix);
	    try {
	      String prefixText = prefix.text();
	      String prefixField = prefix.field();
	      do {
	        Term term = enumerator.term();
	        if (term != null &&
	            term.text().startsWith(prefixText) &&
	            term.field() == prefixField) {
	          TermQuery tq = new TermQuery(term);   // found a match
	          tq.setBoost(getBoost());                // set the
boost
	          query.add(tq, BooleanClause.Occur.SHOULD);    // add
to query
	          //System.out.println("added " + term);
	        } else {
	          break;
	        }
	      } while (enumerator.next());
	    } finally {
	      enumerator.close();
	    }
	    return query;
	  }
	...
	}

I am no expert, but I suspect all that is needed is to watch for the max
clause count and then 'chunk' the boolean query.  I think the following
should work (changes in blue):
 
    BooleanQuery query= new BooleanQuery(true);
    TermEnum enumerator = reader.terms(prefix);
    try {
      String prefixText = prefix.text();
      String prefixField = prefix.field();
      int count = 0;
      do {
        Term term = enumerator.term();
        if (term != null &&
              term.text().startsWith(prefixText) &&
              term.field() == prefixField) {
          count++;
          TermQuery tq = new TermQuery(term);   // found a match
          tq.setBoost(getBoost());                // set the boost
          if (count >= query.getMaxClauseCount()) {
            BooleanQuery subQuery = query;
            query = new BooleanQuery(true);
            query.add(subQuery, BooleanClause.Occur.SHOULD);
            count = 1;  //reset count to 1 (the sub query)
          }
          query.add(tq, BooleanClause.Occur.SHOULD);    // add to query
          //System.out.println("added " + term);
        } else {
          break;
        }
      } while (enumerator.next());
    } finally {
      enumerator.close();
    }
    return query;
  }
 
 
 
Thanks,
Russell Allen

RE: PrefixQuery rewrite() bug, ignores max clause count

Posted by Chris Hostetter <ho...@fucit.org>.

: I understand the concern for resources, but by throwing an exception you
: have also preempted a legitimate query.  How then should one go about
: performing a query for all document where a field begins with char X?

that's the rub .. the "legitimacy" of the query is based on the index it
is executed against, and the maxClauseCount configured for that app .. a
prefix query for b* may be perfectly legitimate in a small index, but very
very bad in another ... that's what that exception is warning you about:
for your index, and your current maxClauseCount configuration, that query
is not legal.

: IMHO, both your concern for resources and the ability to query for
: "name:b*" are legitimate.  Perhaps PrefixeQuery should be able to
: gracefully degrade when it hits resource limits.  I admit I am at the

this has been discussed before, and the crux of the issue is that by
throwing this exception, PrefixQuery (or any other query that rewrites to
a BooleanQUery) is giving you the information you need to make your app
degrade in whatever way you want (gracefull or otherwise).  In apps like
Solr, RangeQueries are by based completely to prevent this problem,
PrefixQueries could likewise be avoided ... i believe Nutch may acutally
have something like what you describe (where if a query cases a toomancy
clauses exception, it rewrites that clause into a filter)

: limit of my Lucene knowledge here, but I assume that rewrite() is an
: optimization of the query and can be skipped (with some additional code)
: at the cost of performance.

nope.  In Lucene, there are two types of queries: primative, and
non-primative.  Primative queries are query classes that can create
Weights/Scorers to execute them, non-primate queries are just containers
that make it easy to model a particular type of search, but in the context
of particular index, "rewrite" themselves into primative queries.
(primative queries rewrite to themselves).  a PrefixQuery *must* rewrite
itself into a BooleanQuery containing one TermQuery for each term in the
index matching that prefix.

: We do not rank or score results at all (Lucene is a high speed index for
: us).  As a result, I am blissfully ignorant of scoring results.  Out of
: curiosity though, does the depth of a term in a query tree affect its
: score?

in a nut shell: yes.  not becuse "depth" really has any meaning to how
scoring works, but because the score produced by a boolean query when some
of it's clauses match depends on the number of clauses, so a boolean query
contaiing 10 terms scores differnetly then a boolean query cotaining 5
boolean queries each containing 2 term queries.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: PrefixQuery rewrite() bug, ignores max clause count

Posted by "Russell M. Allen" <Ru...@aebn.net>.

: 1) colorzing your mail doesn't play nicely with the mailing list ...
: sending "diffs" is the prefered way to show potential changes to the
code.

So I discovered on sending the message.  :)

: 2) assuming i understand what it is you've changed, you've "worked
arround" the TooManyClausesException in a somewhat complicated way (you
could just as : easily increase the maxClauseCount and save yourslef
some
: headache) ... 

Actually, we have increased the maxClauseCount in our production code to
address this problem.  (We don't want to maintain a custom version of
Lucene.)

: by doing this you bypass the whole point of having a
: maClauseCount: to prevent a query that will consume too many resources
(namely RAM and query execution time) from being constructed.

I understand the concern for resources, but by throwing an exception you
have also preempted a legitimate query.  How then should one go about
performing a query for all document where a field begins with char X?
We have 60,000+ documents, of which B, S and T seem to be the most
common initial char (just over 7000 document each).  We are building an
'index' with the search results, thus the need for a query like:
"name:b*"

IMHO, both your concern for resources and the ability to query for
"name:b*" are legitimate.  Perhaps PrefixeQuery should be able to
gracefully degrade when it hits resource limits.  I admit I am at the
limit of my Lucene knowledge here, but I assume that rewrite() is an
optimization of the query and can be skipped (with some additional code)
at the cost of performance. 

: 3) your change modifies the core impacts from each of the Terms that
match the prefix ... if you don't care about the score impacts from the
terms, then : there are other options.  Solr has a PrefixFilter class
which can be wrapped in a ConstantScoreQuery to support prefix queries
regardless of your term : : distribution.

We do not rank or score results at all (Lucene is a high speed index for
us).  As a result, I am blissfully ignorant of scoring results.  Out of
curiosity though, does the depth of a term in a query tree affect its
score?

I will take a look at PrefixFilter and ConstantScoreQuery.

Thanks!
Russell

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: PrefixQuery rewrite() bug, ignores max clause count

Posted by Chris Hostetter <ho...@fucit.org>.

: I dug into the code I found where the PrefixQuery rewrites itself as a
: BooleanQuery.  Unfortunately, it doesn't respect the maxClauseCount of
: the BooleanQuery in the process.  Thus, when we hit a sufficiently large
: number of results, this causes the TooManyClauses exception that a
: number of people have posted about.

this is the expected behavior of a PrefixQuery ... the
TooManyClausesException is explicitly thrown to ensure that a query
consuming too many resources isn't built, and your App is free to do
whatever you want in that situation.

: I am no expert, but I suspect all that is needed is to watch for the max
: clause count and then 'chunk' the boolean query.  I think the following
: should work (changes in blue):

1) colorzing your mail doesn't play nicely with the mailing list ...
sending "diffs" is the prefered way to show potential changes to the code.

2) assuming i understand what it is you've changed, you've "worked
arround" the TooManyClausesException in a somewhat complicated way (you
could just as easily increase the maxClauseCount and save yourslef some
headache) ... by doing this you bypass the whole point of having a
maClauseCount: to prevent a query that will consume too many resources
(namely RAM and query execution time) from being constructed.

3) your change modifies the core impacts from each of the Terms that match
the prefix ... if you don't care about the score impacts from the terms,
then there are other options.  Solr has a PrefixFilter class which can be
wrapped in a ConstantScoreQuery to support prefix queries regardless of
your term distribution.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org