You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Christoph Goller <go...@detego-software.de> on 2004/10/11 10:53:14 UTC

FuzzyQuery prefix length

Last week I committed some changes to FuzzyQuery and QueryParser.
Especailly the default prefix length:

 > 1) It is now using a default prefix length of 2 in order to
 > become usable on big indices. QueryParser has been changed
 > accordingly. QueryParser allows to set the prefixLength for
 > FuzzyQueries.

changes the semantics of FuzzyQueries. I am not sure whether a
value of 2 is adequate. Maybe the default should remain 0 and
folks with big indices should decide by themselve to use a prefix.

Christoph

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: FuzzyQuery prefix length

Posted by Doug Cutting <cu...@apache.org>.
Daniel Naber wrote:
> Searching for Photokopie~ on a 230,000 document corpus takes 2.3 seconds here 
> (AMD Athlon 2600+; other fuzzy terms get similar performance). As the number 
> of terms doesn't increase so fast with more documents, it will not take 10 
> seconds for 1 million documents. So fuzzy search isn't *that* slow.

How long do non-fuzzy queries take?  What is the ratio?  How about a 
query with multiple fuzzy terms?

If someone launches a service but fails to test it with fuzzy queries, 
will they be subject to inadvertant denial-of-service when a user starts 
using fuzzy queries?  Web-based search is particularly vulnerable.  If a 
query takes a few seconds and the user hits his browser's STOP and 
RELOAD buttons, the first query keeps running on the server.

This is not an imaginary problem.  I have worked with several clients 
who have run into this in deployed applications.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: FuzzyQuery prefix length

Posted by Daniel Naber <da...@t-online.de>.
On Wednesday 13 October 2004 00:42, Daniel Naber wrote:

> I'll try to do some performance tests with fuzzy query tomorrow on a
> 250,000 document index.

Searching for Photokopie~ on a 230,000 document corpus takes 2.3 seconds here 
(AMD Athlon 2600+; other fuzzy terms get similar performance). As the number 
of terms doesn't increase so fast with more documents, it will not take 10 
seconds for 1 million documents. So fuzzy search isn't *that* slow.

Regards
 Daniel


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: FuzzyQuery prefix length

Posted by Doug Cutting <cu...@apache.org>.
Daniel Naber wrote:
> On Tuesday 12 October 2004 17:22, Doug Cutting wrote:
>>Which is worse: a person who searches for Photokopie~ in a 1000 document
>>collection does not find documents containing Fotokopie; or a person who
>>searches for Photokopie~ in a 1M document collection doesn't find
>>anything because it takes too long.  I think some relevant results are
>>better than none.
> 
> I disagree, as the user who doesn't get the "Fotokopie" matches will not 
> understand what's going on. He will assume that there are no such 
> documents, which is wrong.

I disagree.   For someone to assume that they would need a detailed 
understanding of how "~" works.  Such a person would likely also know 
whether initial characters are considered in the operation of "~".  Most 
users who use "~" would probably use it when they're uncertain of 
spelling, without a detailed understanding of how it works, and, most of 
the time, it will help them.

> If there's a timeout the user will at least 
> notice something is wrong. Besides that, it's the developers 
> responsibility to get things fast enough.

We're talking about the appropriate default.  Defaults are used by 
unsophisticated developers.  A system deployed by an unsophisticated 
developer should not suffer from erratic timeouts.  Users using the 
standard query syntax should enjoy a reasonable experience on 
multi-million document collections without having to tweak things.

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: FuzzyQuery prefix length

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Oct 26, 2004, at 3:33 PM, Doug Cutting wrote:
> Erik Hatcher wrote:
>> On Oct 20, 2004, at 12:14 PM, Doug Cutting wrote:
>>> The advantages of a zero-character prefix default are that it's 
>>> back-compatibile and that it will find more matches, when spelling 
>>> differences are in the first characters.
>>>
>> I prefer this default.
>> Anyone using QueryParser needs to be aware of the issues of exposing 
>> fuzzy queries, range queries, and any other types the syntax 
>> supports.  It would not be Lucene's fault if a system with millions 
>> of documents is exposed through QueryParser and fuzzy queries take a 
>> bit longer or thrown a TooManyClauses exception.
>
> I am clearly outvoted.  I still disagree, but will not veto this.
>
> My last words on the topic (I promise!): In designing Lucene I tried 
> hard to only add features that were scalable.  For example, one could 
> easily implement a RegexQuery that scans text of stored fields, 
> returning those which match a regex.  This would provide grep-like 
> functionality, which some folks might find useful.  But it would not 
> be scalable.  If someone contributed such a thing I would lobby 
> against permitting its use from QueryParser in the default 
> configuration.  The query parser already requires an initial character 
> before a wildcard, in order to make this operator more scalable.  I 
> don't see why fuzzy queries should be treated differently, why we 
> permit such a huge scalability hole in the default configuration.

I agree completely with your sentiment.  I personally would be happy 
with QueryParser weren't part of Lucene altogether - sure did make 
writing the book much harder, thats for sure!

But with the wildcard query requiring an initial character - at least 
the results you get back would be completely accurate.  With a fuzzy 
query and a required prefix, it would not necessarily be the case, 
given the examples I've seen on here.

Perhaps for Lucene 2.0 we can gut QueryParser and have some type of 
pluggable syntax handlers, so that these inefficient queries like fuzzy 
and wildcard are not initially possible, but could be turned on 
somehow.  I personally recommend, and show how in the book, to throw 
ParseException for both wildcard and fuzzy queries.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: FuzzyQuery prefix length

Posted by Doug Cutting <cu...@apache.org>.
Erik Hatcher wrote:
> On Oct 20, 2004, at 12:14 PM, Doug Cutting wrote:
> 
>> The advantages of a zero-character prefix default are that it's 
>> back-compatibile and that it will find more matches, when spelling 
>> differences are in the first characters.
>>
> 
> I prefer this default.
> 
> Anyone using QueryParser needs to be aware of the issues of exposing 
> fuzzy queries, range queries, and any other types the syntax supports.  
> It would not be Lucene's fault if a system with millions of documents is 
> exposed through QueryParser and fuzzy queries take a bit longer or 
> thrown a TooManyClauses exception.

I am clearly outvoted.  I still disagree, but will not veto this.

My last words on the topic (I promise!): In designing Lucene I tried 
hard to only add features that were scalable.  For example, one could 
easily implement a RegexQuery that scans text of stored fields, 
returning those which match a regex.  This would provide grep-like 
functionality, which some folks might find useful.  But it would not be 
scalable.  If someone contributed such a thing I would lobby against 
permitting its use from QueryParser in the default configuration.  The 
query parser already requires an initial character before a wildcard, in 
order to make this operator more scalable.  I don't see why fuzzy 
queries should be treated differently, why we permit such a huge 
scalability hole in the default configuration.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: FuzzyQuery prefix length

Posted by Scott Ganyo <sc...@ganyo.com>.
I prefer this as well.  But then again I didn't agree with the 
TooManyClauses decision, either, where it was decided that the better 
good was served by protecting the user regardless of whether he or she 
wanted it.  Isn't this pretty much debating this philosophy again?

On Oct 20, 2004, at 12:56 PM, Erik Hatcher wrote:

> On Oct 20, 2004, at 12:14 PM, Doug Cutting wrote:
>> The advantages of a zero-character prefix default are that it's 
>> back-compatibile and that it will find more matches, when spelling 
>> differences are in the first characters.
>>
>
> I prefer this default.
>
> Anyone using QueryParser needs to be aware of the issues of exposing 
> fuzzy queries, range queries, and any other types the syntax supports. 
>  It would not be Lucene's fault if a system with millions of documents 
> is exposed through QueryParser and fuzzy queries take a bit longer or 
> thrown a TooManyClauses exception.
>
> 	Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>

Re: FuzzyQuery prefix length

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Oct 20, 2004, at 12:14 PM, Doug Cutting wrote:
> The advantages of a zero-character prefix default are that it's 
> back-compatibile and that it will find more matches, when spelling 
> differences are in the first characters.
>

I prefer this default.

Anyone using QueryParser needs to be aware of the issues of exposing 
fuzzy queries, range queries, and any other types the syntax supports.  
It would not be Lucene's fault if a system with millions of documents 
is exposed through QueryParser and fuzzy queries take a bit longer or 
thrown a TooManyClauses exception.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: FuzzyQuery prefix length

Posted by Bernhard Messer <Be...@intrafind.de>.
Doug Cutting wrote:

> Daniel Naber wrote:
>
>> On Tuesday 12 October 2004 17:22, Doug Cutting wrote:
>>
>>> Which is worse: a person who searches for Photokopie~ in a 1000 
>>> document
>>> collection does not find documents containing Fotokopie; or a person 
>>> who
>>> searches for Photokopie~ in a 1M document collection doesn't find
>>> anything because it takes too long.  I think some relevant results are
>>> better than none.
>>
>>
>> I disagree, as the user who doesn't get the "Fotokopie" matches will 
>> not understand what's going on. He will assume that there are no such 
>> documents, which is wrong. If there's a timeout the user will at 
>> least notice something is wrong. Besides that, it's the developers 
>> responsibility to get things fast enough. If he decides to do so with 
>> a prefix that might be okay for his use case. 
>
my personal opinion, plus the experience I've made over the last years 
in the area of information retrieval would favorite Daniel's idea to set 
the prefix length to 0 per default. My personal arguments are:

1) most of the developers using lucene, either as a basis or as an 
enhancement on their own products, will deal with an index size not 
bigger than 10.000 documents. These group of developers are happy if 
they have an API which is easy to use and does exactly what they expect. 
They don't worry about internal features and just use it, the way they 
got it. With such an index size, they will never run into a timeout or 
performance problem and they're happy to find all documents belonging to 
a fuzzy query.

2) developers handling large document collection with more than 1M docs 
will study the possibilities and options they have within lucene to 
optimize their system. They will find the knob which has to be screwed 
when running into timeouts  or memory problems. If not, they will ask 
the community to get an hint.

3) I would leave the functional behavior of lucene in future versions 
backward compatible as far as possible. It's no problem to change the 
API, making methods deprecated and and... Modern development 
environments are showing up the deprecation warnings, supporting 
developers to update the software.  But they can't support us, if the 
query results are different changing from lucene 1.4 to lucene 1.9.

Bernhard





> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: FuzzyQuery prefix length

Posted by Doug Cutting <cu...@apache.org>.
Daniel Naber wrote:
> On Tuesday 12 October 2004 17:22, Doug Cutting wrote:
> 
>>Which is worse: a person who searches for Photokopie~ in a 1000 document
>>collection does not find documents containing Fotokopie; or a person who
>>searches for Photokopie~ in a 1M document collection doesn't find
>>anything because it takes too long.  I think some relevant results are
>>better than none.
> 
> I disagree, as the user who doesn't get the "Fotokopie" matches will not 
> understand what's going on. He will assume that there are no such 
> documents, which is wrong. If there's a timeout the user will at least 
> notice something is wrong. Besides that, it's the developers 
> responsibility to get things fast enough. If he decides to do so with a 
> prefix that might be okay for his use case. 

This is clearly not a black-and-white issue.  Can other Lucene 
developers please offer their opinions?

The question is whether the QueryParser should, by default, require a 
one-or-two character prefix match for fuzzy terms, or a zero-character 
prefix, as it does today.

The advantages of a zero-character prefix default are that it's 
back-compatibile and that it will find more matches, when spelling 
differences are in the first characters.

The disadvantage of a zero-character prefix default is that it performs 
poorly for large collections, requring perhaps around 10 seconds for 
multi-million document collections, considerably slower than any other 
type of query supported by the QueryParser.

Similarly, the advantage of a one-or-two-character prefix default is 
that it will perform much better with larger collections.  And the 
disadvantage is that it is an incompatible change, and it will miss some 
matches, those where the spelling differences are in the first characters.

Developers may always change this by calling 
QueryParser.setFuzzyPrefixLength().  So at issue is which behaviour is 
better for developers who do not know of this parameter.  Is it more 
important that their applications perform well or that they find all 
matches to fuzzy queries?

Please offer your opinion and thoughts on this.

Thanks,

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: FuzzyQuery prefix length

Posted by Daniel Naber <da...@t-online.de>.
On Tuesday 12 October 2004 17:22, Doug Cutting wrote:

> Which is worse: a person who searches for Photokopie~ in a 1000 document
> collection does not find documents containing Fotokopie; or a person who
> searches for Photokopie~ in a 1M document collection doesn't find
> anything because it takes too long.  I think some relevant results are
> better than none.

I disagree, as the user who doesn't get the "Fotokopie" matches will not 
understand what's going on. He will assume that there are no such 
documents, which is wrong. If there's a timeout the user will at least 
notice something is wrong. Besides that, it's the developers 
responsibility to get things fast enough. If he decides to do so with a 
prefix that might be okay for his use case. 

I'll try to do some performance tests with fuzzy query tomorrow on a 
250,000 document index.

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: FuzzyQuery prefix length

Posted by Christoph Goller <go...@detego-software.de>.
Both sides have good arguments. I am undecisive.
However, the discussion is only about the default. Whatever
we choose, developers using Lucene will stll have all options.

I briefly thought about integrating David Spencer's
ngram-spellchecker into Lucene. It could make FuzyQuery fast AND
complete. However, it would make indexing slower since Lucene would
have to update the additional index of terms and term ngrams.
Furthermore, I don't know a good solution for deleting terms from
this additional index if documents get deleted.

Christoph

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: FuzzyQuery prefix length

Posted by Doug Cutting <cu...@apache.org>.
Daniel Naber wrote:
> -It is the only change so far that we cannot express in the API, i.e. we 
> cannot just deprecate a method to make Lucene's users aware of this. So we 
> can only list it in CHANGES.txt, where some people will surely miss it.

We could define a new query parser class with the new behaviour and 
deprecate the old query parser.  I am not advocating this, merely noting 
that it is possible to make this change back-compatibly.

If we agree that this change does make Lucene better (and I'm not sure 
we do) then we should make the change, no?  Back-compatibility is a good 
thing, but, with a major release, should quality suffer becaue of 
back-compatibility issues?  I hope not.  Rather we should take the 
opportunity of a major release to make Lucene as good as we can.

> -There are words in German like Photokopie/Fotokopie which have the same 
> meaning and a very similar spelling, so people will expect a FuzzyQuery to 
> match such words. But as the difference is in the first two characters it 
> won't be found with the default.
> 
> -People whose index is just 1000 documents large will probably not notice a 
> difference in speed, but they might see a difference in quality (see 
> above). Why should these people change the default instead of those with a 
> 10 mio document index?

Which is worse: a person who searches for Photokopie~ in a 1000 document 
collection does not find documents containing Fotokopie; or a person who 
searches for Photokopie~ in a 1M document collection doesn't find 
anything because it takes too long.  I think some relevant results are 
better than none.  Classes of queries which take orders of magnitude 
longer than others are a problem.

Doug



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: FuzzyQuery prefix length

Posted by Daniel Naber <da...@t-online.de>.
On Monday 11 October 2004 18:20, Doug Cutting wrote:

> However 2.0 is our opportunity to make incompatible changes.  What is
> the best default for this, that will work well for the most
> applications?

I see the following problems with a default > 0:

-It is the only change so far that we cannot express in the API, i.e. we 
cannot just deprecate a method to make Lucene's users aware of this. So we 
can only list it in CHANGES.txt, where some people will surely miss it.

-There are words in German like Photokopie/Fotokopie which have the same 
meaning and a very similar spelling, so people will expect a FuzzyQuery to 
match such words. But as the difference is in the first two characters it 
won't be found with the default.

-People whose index is just 1000 documents large will probably not notice a 
difference in speed, but they might see a difference in quality (see 
above). Why should these people change the default instead of those with a 
10 mio document index?

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: FuzzyQuery prefix length

Posted by Bernhard Messer <Be...@intrafind.de>.
Doug Cutting wrote:

> Does anyone have fuzzy-query benchmarks for, e.g., ~1M document 
> indexes, where each document contains a few k of text?  Ideally with 
> such indexes, even complex queries should take less than a second, 
> no?  How long does a fuzzy query take?  And how much does a prefix of 
> zero, one, or two change that?  Queries that take much longer than a 
> second are considerably less usable.  I think the the default should 
> provide good usability for indexes of at least 1M documents. 

i've an index containing about 800k documents, a few kb of text for each 
document. Every lucene doc in the index has about 12 fields. The overall 
index size is about 2.8 GB.

> Another thing to examine is how different the generated terms are with 
> different prefixes.  One could randomly select some words from an 
> index and compute the average amount that a prefix of one and two 
> changes the end results.  My guess is that the changes are small.  
> Since fuzzy search is a heuristic, not an exact computation, good 
> approximations are fair play.
>
If that fits you're need, i can create and run a test for query 
benchmarking.

regards
Bernhard

> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: FuzzyQuery prefix length

Posted by Doug Cutting <cu...@apache.org>.
Daniel Naber wrote:
> I agree that the default should stay 0, even for Lucene 2.0.

It should certainly stay zero for 1.4.x releases.

However 2.0 is our opportunity to make incompatible changes.  What is 
the best default for this, that will work well for the most applications?

Does anyone have fuzzy-query benchmarks for, e.g., ~1M document indexes, 
where each document contains a few k of text?  Ideally with such 
indexes, even complex queries should take less than a second, no?  How 
long does a fuzzy query take?  And how much does a prefix of zero, one, 
or two change that?  Queries that take much longer than a second are 
considerably less usable.  I think the the default should provide good 
usability for indexes of at least 1M documents.

Another thing to examine is how different the generated terms are with 
different prefixes.  One could randomly select some words from an index 
and compute the average amount that a prefix of one and two changes the 
end results.  My guess is that the changes are small.  Since fuzzy 
search is a heuristic, not an exact computation, good approximations are 
fair play.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: FuzzyQuery prefix length

Posted by Bernhard Messer <Be...@intrafind.de>.
Daniel Naber wrote:

>On Monday 11 October 2004 10:53, Christoph Goller wrote:
>
>  
>
>>Maybe the default should remain 0 and
>>folks with big indices should decide by themselve to use a prefix.
>>    
>>
>
>I agree that the default should stay 0, even for Lucene 2.0.
>  
>
yeap, if going for a default of 0, we also could delete the bug i opened 
yesterday ( http://issues.apache.org/bugzilla/show_bug.cgi?id=31619 ) 
because everything will work like before.

regards
Bernhard

>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>  
>


Re: FuzzyQuery prefix length

Posted by Daniel Naber <da...@t-online.de>.
On Monday 11 October 2004 10:53, Christoph Goller wrote:

> Maybe the default should remain 0 and
> folks with big indices should decide by themselve to use a prefix.

I agree that the default should stay 0, even for Lucene 2.0.

Regards
 Daniel


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org