You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by David Spencer <da...@tropo.com> on 2004/02/12 22:45:53 UTC

SubstringQuery -- Re: Leading Wild Card Search

Kristian Hermsdorf wrote:

> Hi
>
>> I've written a PrefixQuery and it's not hard to do  -I can post it too.
>> Problem is that it is not integrated into the query parser (.jj) so odds
>> are noone will use it, and the general sentiment on this list (and 
>> lucene-dev)
>> is that prefix queries are evil because it's an expensive operation 
>> as the query
>> code has to traverse all terms to "expand" the query. I would prefer
>> a more user oriented view i.e. just allow it as sometimes it's what 
>> you need and
>> the only alternative I can think of, doing a fuzzy query, isn't quite 
>> right.
>
>
> wow - great!
> I'm looking for a sample code for quite a goode time. I'd like to test 
> the performance on our data to see if it's really that slow.

2 files attached, SubstringQuery (which you'll use) and 
SubstringTermEnum ( used by the former to be
consistent w/ other Query code).

I find this kind of query useful to have and think that the query parser 
should allow it in spite of the perception
of this being slow, however I think the debate is the "user centric 
view" (say mine, allow substring queries)
vs the "protect the engines performance" view which says not to allow 
expensive queries.

>
> It would be great if you could post a URL whrer to find your extension.
>
> thank you
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


number of fields, size of fields

Posted by Anson Lau <al...@fulfil-net.com>.
Hi All,

I'm a beginner with Lucene.  I would like to know if there are general
guidelines on:

1. the number of field a document can have
2. size of unindexed fields
3. size of a stored text field

I just want to get a feel for what are the good practices.

Thanks,

Anson Lau


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: SubstringQuery -- Re: Leading Wild Card Search

Posted by Terry Steichen <te...@net-frame.com>.
Doug,

What you say makes a good deal of sense to me.  Could you give us a relative
sense of the "slowness" of different operators?

Regards

Terry

----- Original Message -----
From: "Doug Cutting" <cu...@apache.org>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Tuesday, February 17, 2004 1:16 PM
Subject: Re: SubstringQuery -- Re: Leading Wild Card Search


> David Spencer wrote:
> > 2 files attached, SubstringQuery (which you'll use) and
> > SubstringTermEnum ( used by the former to be
> > consistent w/ other Query code).
> >
> > I find this kind of query useful to have and think that the query parser
> > should allow it in spite of the perception
> > of this being slow, however I think the debate is the "user centric
> > view" (say mine, allow substring queries)
> > vs the "protect the engines performance" view which says not to allow
> > expensive queries.
>
> I think the argument is more complex.
>
> One issue is cost of execution: very slow queries can be used to
> implement a denial-of-service attack.  Maybe that's an overstatement,
> but in a web server setting, once a few of slow searches are running, no
> others may complete.  When folks hit "Stop" in their browser the server
> does not stop processing the query.  If they hit "Reload" then another
> new search is started.  So these can be very problematic.  This is real.
>   Lots of folks have deployed Lucene with large indexes and then found
> that their server randomly crashes.  Closer scrutiny shows that they
> were permitting operators that are too slow for their combination of
> index size and query traffic.  The BooleanQuery.TooManyClauses exception
> was added to address this, but it can still be too late, if the problem
> is caused before the query is built, e.g., while enumerating all terms.
>
> A releated issue is that users (and even most developers) don't
> understand the relative costs of different query operators.  Some things
> are fast, others are surprisingly slow.  That's not a great user
> experience, and triggers problems like those described above.  People
> think that the rare slow cases are network problems or something, and
> hit "Reload".
>
> I have no problem with including slow operators with Lucene, but they
> should be well documented as such, at least for developers.  Perhaps we
> should make a pass through the existing Query classes, in particular
> those which expand into other queries, and add some performance notes,
> so that folks don't blindly start using things which may bite them.  By
> default I think it would be safest if the QueryParser only permitted
> operators which are efficient.  Folks can then, at their own risk,
> enable other operators.
>
> In summary, removing operators can be user-centric, if it removes
> unpredictablity.  And the reason for protecting engine performance is
> not miserly, it's to guarantee availablility.  And finally, an issue
> dear to me, a predicatble search engine results in fewer spurious bug
> reports, saving developer time for real bugs.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: SubstringQuery -- Re: Leading Wild Card Search

Posted by David Spencer <da...@tropo.com>.
Doug Cutting wrote:

> David Spencer wrote:
>
>> 2 files attached, SubstringQuery (which you'll use) and 
>> SubstringTermEnum ( used by the former to be
>> consistent w/ other Query code).
>>
>> I find this kind of query useful to have and think that the query 
>> parser should allow it in spite of the perception
>> of this being slow, however I think the debate is the "user centric 
>> view" (say mine, allow substring queries)
>> vs the "protect the engines performance" view which says not to allow 
>> expensive queries.
>
>
> I think the argument is more complex.

Thanks for elaboration, a few notes below.

>
> One issue is cost of execution: very slow queries can be used to 
> implement a denial-of-service attack.  Maybe that's an overstatement, 
> but in a web server setting, once a few of slow searches are running, 
> no others may complete.  When folks hit "Stop" in their browser the 
> server does not stop processing the query.  If they hit "Reload" then 
> another new search is started.  So these can be very problematic.

I guess this is defendable right? The search engine could allow  only 
one query at a time per session.

> This is real.  Lots of folks have deployed Lucene with large indexes 
> and then found that their server randomly crashes.  Closer scrutiny 
> shows that they were permitting operators that are too slow for their 
> combination of index size and query traffic.  The 
> BooleanQuery.TooManyClauses exception was added to address this, but 
> it can still be too late, if the problem is caused before the query is 
> built, e.g., while enumerating all terms.

Yeah, this is tricky, as really the TooManyClauses is not necessarily 
the cure, and every query can't defend itself against expense as these 
heuristics (e.g. max # of clauses) are not always right (depends on 
speed of system/disk, size of index, etc).
It's too bad the Java VM doesn't have any kind of execution time quotas 
for threads or something like that as that seems like in the ideal world 
to get to the root of the issue. Going back in time I believe the 
Telescript VM
from General Magic...

>
> A releated issue is that users (and even most developers) don't 
> understand the relative costs of different query operators.

Sure.
I'm coming at this from the point of view of the smaller personal/small 
enterprise search server, where
you "know" you need a certain kind of possibly expensive query and don't 
want your hands tied.

> Some things are fast, others are surprisingly slow.  That's not a 
> great user experience, and triggers problems like those described above.

Reminds me of a human factors paper I saw in some ACM pub years ago (um, 
prob the 80's). Said more or less what you're saying, that humans prefer 
a constant response time even if a varying response time has a lower 
average. Example is, say, for a compiler, if it always takes 10sec to 
compile a file they users are happy, but
if it takes 1sec 99% of the time and a minute the other 1% then users 
are less happy. What was most interesting
about the paper was they suggested something like this:

goal = 10; // 10 sec
t1 = time(); // time in seconds
search(); // execute search, don't display results
dt = time() - t1; // elapsed
if ( elapsed < goal)
   sleep( goal- elapsed); // the shocker
display_results()

i.e. artifically delay(!) if less than some threshold, thus making the 
user happier (!) to get a more constant response time.



> People think that the rare slow cases are network problems or 
> something, and hit "Reload".
>
> I have no problem with including slow operators with Lucene, but they 
> should be well documented as such, at least for developers.  Perhaps 
> we should make a pass through the existing Query classes, in 
> particular those which expand into other queries, and add some 
> performance notes, so that folks don't blindly start using things 
> which may bite them.  By default I think it would be safest if the 
> QueryParser only permitted operators which are efficient.  Folks can 
> then, at their own risk, enable other operators.
>
> In summary, removing operators can be user-centric, if it removes 
> unpredictablity.  And the reason for protecting engine performance is 
> not miserly, it's to guarantee availablility.  And finally, an issue 
> dear to me, a predicatble search engine results in fewer spurious bug 
> reports, saving developer time for real bugs.

Great, thx..
 -Dave

>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: SubstringQuery -- Re: Leading Wild Card Search

Posted by Doug Cutting <cu...@apache.org>.
David Spencer wrote:
> 2 files attached, SubstringQuery (which you'll use) and 
> SubstringTermEnum ( used by the former to be
> consistent w/ other Query code).
> 
> I find this kind of query useful to have and think that the query parser 
> should allow it in spite of the perception
> of this being slow, however I think the debate is the "user centric 
> view" (say mine, allow substring queries)
> vs the "protect the engines performance" view which says not to allow 
> expensive queries.

I think the argument is more complex.

One issue is cost of execution: very slow queries can be used to 
implement a denial-of-service attack.  Maybe that's an overstatement, 
but in a web server setting, once a few of slow searches are running, no 
others may complete.  When folks hit "Stop" in their browser the server 
does not stop processing the query.  If they hit "Reload" then another 
new search is started.  So these can be very problematic.  This is real. 
  Lots of folks have deployed Lucene with large indexes and then found 
that their server randomly crashes.  Closer scrutiny shows that they 
were permitting operators that are too slow for their combination of 
index size and query traffic.  The BooleanQuery.TooManyClauses exception 
was added to address this, but it can still be too late, if the problem 
is caused before the query is built, e.g., while enumerating all terms.

A releated issue is that users (and even most developers) don't 
understand the relative costs of different query operators.  Some things 
are fast, others are surprisingly slow.  That's not a great user 
experience, and triggers problems like those described above.  People 
think that the rare slow cases are network problems or something, and 
hit "Reload".

I have no problem with including slow operators with Lucene, but they 
should be well documented as such, at least for developers.  Perhaps we 
should make a pass through the existing Query classes, in particular 
those which expand into other queries, and add some performance notes, 
so that folks don't blindly start using things which may bite them.  By 
default I think it would be safest if the QueryParser only permitted 
operators which are efficient.  Folks can then, at their own risk, 
enable other operators.

In summary, removing operators can be user-centric, if it removes 
unpredictablity.  And the reason for protecting engine performance is 
not miserly, it's to guarantee availablility.  And finally, an issue 
dear to me, a predicatble search engine results in fewer spurious bug 
reports, saving developer time for real bugs.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org