You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Giulio Cesare Solaroli <gi...@gmail.com> on 2004/11/12 15:11:26 UTC

Re: Bug in the BooleanQuery optimizer? ..TooManyClauses

Hi all,

I am cross-posting my reply also to developer list because I think some of
my arguments belong there.

I was thinking about extending somehow the PhraseQuery analyzer in
order to better handle wild character expansion.

Sanyi idea to "optimize" the expansion of the terms to include just the ones
meaningful for the subset of documents found by other part of the
query is intriguing, but probably very difficult to implement.

My idea will probably more easy to implement, even if the final result
could be not 100% exact, it could probably be good enough. The idea is
to let the developer handle the boolean query limit in the following
way:
- leave the current implementation, raising an exception;
- handle the exception and limit the boolean query to the first 1024
(or what ever the limit is) terms;
- select, between the possible terms, only the first 1024 (or what
ever the limit is) more meaningful ones, leaving out all the others.

I had this idea watching how some terms where expanded against our
index. Many of them where clearly wrong words, filenames, or any other
kind of irrelevant info that was not easy to remove before indexing.

This solution changes the return results in a subtle way (even if only
when the current implementation is throwing an exception) and so the
developer should be very careful to report to her users that the query
could have left out some documents.

The "most meaningful", in this context, could be proportionate to the
number of documents having that term in the whole index, as a first
approximation.

Does this idea sounds interesting to any of you?

Regards,

Giulio Cesare Solaroli



On Thu, 11 Nov 2004 11:57:32 -0800 (PST), Sanyi <ne...@yahoo.com> wrote:
> Yes, I understand all of this, but I don't want to set it to MaxInt, since it can easily lead to
> (even accidental) DoS attacks.
> 
> What I'm saying is that there is no reason for the optimizer to expand wild* to more than 1024
> variations when I search for "somerareword AND wild*", since somerareword is only present in let's
> say 100 documents, so wild* should only expand to words beginning with "wild" in those 100
> documents, then it should work fine with the default 1024 clause limit.
> 
> But it doesn't, so I can choose between unuseable queries or accidental DoS attacks.
> 
> 
> 
> --- Will Allen <wa...@Cyveillance.com> wrote:
> 
> > Any wildcard search will automatically expand your query to the number of terms it find in the
> > index that suit the wildcard.
> >
> > For example:
> >
> > wild*, would become wild OR wilderness OR wildman etc for each of the terms that exist in your
> > index.
> >
> > It is because of this, that you quickly reach the 1024 limit of clauses.  I automatically set it
> > to max int with the following line:
> >
> > BooleanQuery.setMaxClauseCount( Integer.MAX_VALUE );
> >
> >
> > -----Original Message-----
> > From: Sanyi [mailto:need4sid@yahoo.com]
> > Sent: Thursday, November 11, 2004 6:46 AM
> > To: lucene-user@jakarta.apache.org
> > Subject: Bug in the BooleanQuery optimizer? ..TooManyClauses
> >
> >
> > Hi!
> >
> > First of all, I've read about BooleanQuery$TooManyClauses, so I know that it has a 1024 Clauses
> > limit by default which is good enough for me, but I still think it works strange.
> >
> > Example:
> > I have an index with about 20Million documents.
> > Let's say that there is about 3000 variants in the entire document set of this word mask: cab*
> > Let's say that about 500 documents are containing the word: spectrum
> > Now, when I search for "cab* AND spectrum", I don't expect it to throw an exception.
> > It should first restrict the search for the 500 documents containing the word "spectrum", then
> > it
> > should collect the variants of "cab*" withing these documents, which turns out in two or three
> > variants of "cab*" (cable, cables, maybe some more) and the search should return let's say 10
> > documents.
> >
> > Similar example: When I search for "cab* AND nonexistingword" it still throws a TooManyClauses
> > exception instead of saying "No results", since there is no "nonexistingword" in my document
> > set,
> > so it doesn't even have to start collecting the variations of "cab*".
> >
> > Is there any path for this issue?
> > Thank you for your time!
> >
> > Sanyi
> > (I'm using: lucene 1.4.2)
> >
> > p.s.: Sorry for re-sending this message, I was first sending it as an accidental reply to a
> > wrong thread..
> >
> >
> >
> > __________________________________
> > Do you Yahoo!?
> > Check out the new Yahoo! Front Page.
> > www.yahoo.com
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
> 
> __________________________________
> Do you Yahoo!?
> Check out the new Yahoo! Front Page.
> www.yahoo.com
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Bug in the BooleanQuery optimizer? ..TooManyClauses

Posted by Daniel Naber <da...@t-online.de>.
On Friday 12 November 2004 15:11, Giulio Cesare Solaroli wrote:

> - select, between the possible terms, only the first 1024 (or what
> ever the limit is) more meaningful ones, leaving out all the others.

This is, BTW, what FuzzyQuery in CVS HEAD does now. For FuzzyQuery, 
however, it's easier to find the "best" expansion terms.

I just notice that this change is missing from the changelog. Christoph, 
could you add it?

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Bug in the BooleanQuery optimizer? ..TooManyClauses

Posted by Paul Elschot <pa...@xs4all.nl>.
On Saturday 13 November 2004 09:16, Sanyi wrote:
> > - leave the current implementation, raising an exception;
> > - handle the exception and limit the boolean query to the first 1024
> > (or what ever the limit is) terms;
> > - select, between the possible terms, only the first 1024 (or what
> > ever the limit is) more meaningful ones, leaving out all the others.
> 
> I like this idea and I would finalize to myself like this:
> I'd also create a default rule for that to avoid handling exceptions for 
people who're happy with
> the default behavior:
> 
> Keep and search for only the longest 1024 fragments, so it'll throw 
a,an,at,and,add,etc.., but
> it'll automatically keep 1024 variations like 
alpha,alfa,advanced,automatical,etc..

Wouldn't it be counterintuitive to only use the longest matches
for truncations?
To have only longer matches one can also use queries with
multiple ? characters, each matching exactly one character.

I think it would be better encourage the users to use longer
and maybe also more prefixes. This gives more precise results
and is more efficient to execute.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Bug in the BooleanQuery optimizer? ..TooManyClauses

Posted by Sanyi <ne...@yahoo.com>.
> - leave the current implementation, raising an exception;
> - handle the exception and limit the boolean query to the first 1024
> (or what ever the limit is) terms;
> - select, between the possible terms, only the first 1024 (or what
> ever the limit is) more meaningful ones, leaving out all the others.

I like this idea and I would finalize to myself like this:
I'd also create a default rule for that to avoid handling exceptions for people who're happy with
the default behavior:

Keep and search for only the longest 1024 fragments, so it'll throw a,an,at,and,add,etc.., but
it'll automatically keep 1024 variations like alpha,alfa,advanced,automatical,etc..
So, it'll automatically lower the search overhead and will still search fine without throwing
exceptions.
(for people who prefer the widest search range and do not care about the huge overhead, we could
leave a boolean switch for keeping not the longest, but the shortest fragments)


		
__________________________________ 
Do you Yahoo!? 
Check out the new Yahoo! Front Page. 
www.yahoo.com 
 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org