You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Sean Timm <ti...@aol.com> on 2007/03/15 19:44:33 UTC

search timeout

Nutch recently added a search query timeout (NUTCH-308).  Are there any 
plans to add such functionality to the Lucene HitCollector directly?  Or 
is there some reason that this is a bad idea?

I'm using Solr which doesn't seem to support search timeouts.  It seems 
that it would make sense to add the feature at the Lucene level rather 
than implement the feature in each derivative.

Thanks,
Sean

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search timeout

Posted by Erick Erickson <er...@gmail.com>.

On 3/17/07, Chris Hostetter <ho...@fucit.org> wrote:
>
>
> Ack! ... this is what happens when i only skim a patch and then write with
> my odd mix of authority and childlike speling....


I'm telling ya, man, ya gotta get Firefox, use Gmail (or at least a
web-interfaced e-mail client) and turn on the auto spellcheck <G>....

: * it creates a single (static) timer thread, which counts the "ticks",
> : every couple hundred ms (configurable). It uses a volatile int counter,
> : therefore avoiding the need to synchronize.
> :
> : * each HitColector records the start tick count in its constructor, and
> : then checks the current tick count in collect(...). If the difference is
>
> So i was way wrong about the Timer per search ... but it seems like this
> appraoch still has the downside that "long" searches resulting in no
> matches won't time out (because collect will never be called and the tick
> counter will never be compared)
>
> Was this considered a non issue for Nutch because the query structure is
> typiclly well known and quereis with no results usually return
> imeediately? ... in the totally generic case, this isn't a safe
> assumption.  Crazy complex BooleanQueries, or worse still: arbitrary
> client written Query classes, could spend untold times advancing to the
> "next" match (which may not exist at all)
>
>
>
> -Hoss
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: search timeout

Posted by Chris Hostetter <ho...@fucit.org>.

: > imeediately? ... in the totally generic case, this isn't a safe

: This was implemented as an easy way to control the maximum search time
: for typical queries. I'm open for suggestions how to improve it. One

The only thing i can think of that would truely timeout *any* query is a
seperate Timer for each search.

: thing that sticks like a sore thumb is the use of exceptions to break
: the loop - IMHO the collect() method should simply return a boolean or
: int code that tells other parts of Lucene to stop collecting hits.

HitCollector is an abstract class, it would be pretty easy to add another
version of collect that doesn't return void and add Impl's for each that
call the other one by default (the way Analyzer.tokenStream use to work)




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search timeout

Posted by Andrzej Bialecki <ab...@getopt.org>.

Chris Hostetter wrote:
> Ack! ... this is what happens when i only skim a patch and then write with
> my odd mix of authority and childlike speling....
>
> : * it creates a single (static) timer thread, which counts the "ticks",
> : every couple hundred ms (configurable). It uses a volatile int counter,
> : therefore avoiding the need to synchronize.
> :
> : * each HitColector records the start tick count in its constructor, and
> : then checks the current tick count in collect(...). If the difference is
>
> So i was way wrong about the Timer per search ... but it seems like this
> appraoch still has the downside that "long" searches resulting in no
> matches won't time out (because collect will never be called and the tick
> counter will never be compared)
>
> Was this considered a non issue for Nutch because the query structure is
> typiclly well known and quereis with no results usually return
> imeediately? ... in the totally generic case, this isn't a safe
>   

This was implemented as an easy way to control the maximum search time 
for typical queries. I'm open for suggestions how to improve it. One 
thing that sticks like a sore thumb is the use of exceptions to break 
the loop - IMHO the collect() method should simply return a boolean or 
int code that tells other parts of Lucene to stop collecting hits.


> assumption.  Crazy complex BooleanQueries, or worse still: arbitrary
> client written Query classes, could spend untold times advancing to the
> "next" match (which may not exist at all)
>   

Yes, any suggestions are welcome :)

Andrzej

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search timeout

Posted by Chris Hostetter <ho...@fucit.org>.

Ack! ... this is what happens when i only skim a patch and then write with
my odd mix of authority and childlike speling....

: * it creates a single (static) timer thread, which counts the "ticks",
: every couple hundred ms (configurable). It uses a volatile int counter,
: therefore avoiding the need to synchronize.
:
: * each HitColector records the start tick count in its constructor, and
: then checks the current tick count in collect(...). If the difference is

So i was way wrong about the Timer per search ... but it seems like this
appraoch still has the downside that "long" searches resulting in no
matches won't time out (because collect will never be called and the tick
counter will never be compared)

Was this considered a non issue for Nutch because the query structure is
typiclly well known and quereis with no results usually return
imeediately? ... in the totally generic case, this isn't a safe
assumption.  Crazy complex BooleanQueries, or worse still: arbitrary
client written Query classes, could spend untold times advancing to the
"next" match (which may not exist at all)



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search timeout

Posted by Andrzej Bialecki <ab...@getopt.org>.

markharw00d wrote:
> Chris Hostetter wrote:
>> this is something anyone using the Lucene API can do as long as they 
>> use a
>> HitCollector ... the Nutch impl seems to ctually spin up a seperate 
>> thread
>>   
>
> I'm keen to understand the pros and cons of these two approaches.
>
> With the HitCollector approach is this just engineering a fall at the 
> final hurdle? It could be that long running queries spend all their 
> time doing edit-distance comparisions for a a fuzzy boolean query, 
> say  or reading TermDocs for a large range filter to create a BitSet 
> only to be aborted at the collection stage?
> Another point - I noticed in some basic timing tests that calling 
> System.currentTimeMillis() in a tight loop like for *every* call to 
> HitCollector.collect(..) could add reasonable overhead so you probably 
> only want to call this for every nth document collected when testing 
> execution times.

That's why Nutch implementation doesn't do this (I know, I wrote it ;) ).

What it does is the following (please see the patch for details):

* it creates a single (static) timer thread, which counts the "ticks", 
every couple hundred ms (configurable). It uses a volatile int counter, 
therefore avoiding the need to synchronize.

* each HitColector records the start tick count in its constructor, and 
then checks the current tick count in collect(...). If the difference is 
too large then it throws a RuntimeException (NOTE: would someone 
*please* refactor this API so that we can exit this loop more gracefully!).

This design has several benefits: it avoids creating too many timer 
threads (there is just one per JVM), it avoids the need to synchronize 
on the value being changed, and it avoids calling 
System.currentTimeMillis().

Best regards,
Andrzej

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search timeout

Posted by karl wettin <ka...@gmail.com>.

17 mar 2007 kl. 10.07 skrev markharw00d:

> Chris Hostetter wrote:
>> this is something anyone using the Lucene API can do as long as  
>> they use a
>> HitCollector ... the Nutch impl seems to ctually spin up a  
>> seperate thread
>>
>
> I'm keen to understand the pros and cons of these two approaches.
>
> With the HitCollector approach is this just engineering a fall at  
> the final hurdle? It could be that long running queries spend all  
> their time doing edit-distance comparisions for a a fuzzy boolean  
> query, say  or reading TermDocs for a large range filter to create  
> a BitSet only to be aborted at the collection stage?
> Another point - I noticed in some basic timing tests that calling  
> System.currentTimeMillis() in a tight loop like for *every* call to  
> HitCollector.collect(..) could add reasonable overhead so you  
> probably only want to call this for every nth document collected  
> when testing execution times.

I'd be on the look-out for complex queries that yeild none or very  
few results. If not running in an own thread, time out might not be  
triggered in reasonble time.

My guess is that most environments running of a J2SE JVM have no  
problem with twice as many threads. Given there is no extent use of  
the memory on the stack (serializing huge object graphs,  
introspection/reflection, et c) one should be able to optimize memory  
usage with -Xss.

-- 
karl

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search timeout

Posted by Chris Hostetter <ho...@fucit.org>.

: > this is something anyone using the Lucene API can do as long as they use a
: > HitCollector ... the Nutch impl seems to ctually spin up a seperate thread
: >
:
: I'm keen to understand the pros and cons of these two approaches.

to clarify, it's really just one approach, with an extension: Nutch is
still "timingout" the underlying Lucene calls by using a HitCollector --
it's just extending the basic concept to use a Timer thread to abort at
the timeout without waiting for a collect call to notice that too much
time has elapsed.

This Timer version Nutch is using certainly seems like a more robust and
elegent approach to me ... i'm just shy about spining up Threads ... i
think it stems from making lots of disasterous multithreading mistakes in
C code when i was a foolish young engineer.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search timeout

Posted by markharw00d <ma...@yahoo.co.uk>.

Chris Hostetter wrote:
> this is something anyone using the Lucene API can do as long as they use a
> HitCollector ... the Nutch impl seems to ctually spin up a seperate thread
>   

I'm keen to understand the pros and cons of these two approaches.

With the HitCollector approach is this just engineering a fall at the 
final hurdle? It could be that long running queries spend all their time 
doing edit-distance comparisions for a a fuzzy boolean query, say  or 
reading TermDocs for a large range filter to create a BitSet only to be 
aborted at the collection stage?
Another point - I noticed in some basic timing tests that calling 
System.currentTimeMillis() in a tight loop like for *every* call to 
HitCollector.collect(..) could add reasonable overhead so you probably 
only want to call this for every nth document collected when testing 
execution times.

Cheers
Mark

___________________________________________________________ 
Try the all-new Yahoo! Mail. "The New Version is radically easier to use"  The Wall Street Journal 
http://uk.docs.yahoo.com/nowyoucan.html

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: search timeout

Posted by Chris Hostetter <ho...@fucit.org>.

: Nutch recently added a search query timeout (NUTCH-308).  Are there any
: plans to add such functionality to the Lucene HitCollector directly?  Or
: is there some reason that this is a bad idea?

Quickly skimming the patch in that Issue, Nutch seems to have done what
has been discussed previously on this list: using a HitCollector which
throws a RuntimeException if a certain amount of time has elapsed.

this is something anyone using the Lucene API can do as long as they use a
HitCollector ... the Nutch impl seems to ctually spin up a seperate thread
for each request rather then comparing timestamps after each doc is
colelcted (and interesting choice that both frightens and excites me)

a HitCollector like this could easily be "promoted" up in the Lucene code
base ... the real question is would we want timeout info like this to be
exposed in some of the simpler Searcher APIs (ie: that return TopDocs or
Hits) and if so how do you signal that it's only a partial result set?


: I'm using Solr which doesn't seem to support search timeouts.  It seems
: that it would make sense to add the feature at the Lucene level rather
: than implement the feature in each derivative.

even if it were added to the Lucene core, some fairly heavy questions
would have to be answered before encorporating this into Solr: mainly what
do do about Caching ... do you cache the partial results? do you attempt
to continue searching after returning the partial results? the next time a
request comes in and partial results are cached, do you try to pick up
where you left off since you've got a threshold of time available?

because of these issues, Solr may need a custom solution to this situation.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org