You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by David Causse <no...@laposte.net> on 2016/12/21 12:27:51 UTC

TimeLimitingCollector accuracy

Hi,

This subject has been discussed in the past but I don't think that any 
real solution was implemented yet.

Here is a small test case to illustrate the problem: 
https://github.com/nomoa/lucene-solr/commit/2f025b18899038c8606da64c2cf9f4e1f643607f#diff-65ae49ceb38e45a3fc05115be5e61a2dR387

This test will print:

Time waited on a slow query that matches all docs: 1109
Time waited on a slow query that matches no docs: 137258

The problem is that the time check is "passive", meaning that on large 
segments if the query is slow and matches no documents the timeout is 
very inaccurate making it nearly impossible to adjust client timeout vs 
collector timeout.

It happens to me where I have a query that implements a TwoPhaseIterator 
with an approximation that can be really bad not to say completely wrong 
(regex search on stored content with an approximation based on extracted 
tri-grams).

Another problem I discovered is that if the query is accepted by the 
QueryCache it will eagerly set its bitset bypassing the Collector.

Reading 
https://www.mail-archive.com/java-dev@lucene.apache.org/msg25694.html I 
see that one suggested solution was to move the timeout check at a lower 
level (in the scorers) but it raised some concerns about checking the 
timeout too frequently.

But given that some efforts have been done to separate sub scorers from 
"top-level" scorers (see 
https://issues.apache.org/jira/browse/LUCENE-5487) would it make sense 
now to make BulkScorers aware of some time constraints?

On my side, as a workaround to prevent catastrophes I'll probably 
continue to implement a circuit breaker in my TwoPhaseIterator#matches 
to either stop doing costly operation by returning false or by throwing 
an exception.

Lastly, I think it could help me to workaround this problem if the 
constructor of TimeExceededException was public, are there any reasons 
for this constructor to be private? Would it break important workflows 
if a scorer starts to throw this exception? It'd allow me to still 
return partial results.

Thanks for your help


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: TimeLimitingCollector accuracy

Posted by David Causse <no...@laposte.net>.

Le 21/12/2016 � 13:27, David Causse a �crit :
> But given that some efforts have been done to separate sub scorers 
> from "top-level" scorers (see 
> https://issues.apache.org/jira/browse/LUCENE-5487) would it make sense 
> now to make BulkScorers aware of some time constraints? 

Looking a bit closer I don't think this could help to resolve accuracy 
problems, maybe it'd help a bit when a TwoPhaseIterator is available to 
check the timeout before calling twoPhase#matches()...

Anyways, if someone has suggestions on how to write a time-bounded query 
that wants to return partial results (and still informs the Collector 
that it's partial) given that the costly operation happens in 
twoPhase#matches() it'd be much appreciated.

Currently I ended up writing an ugly workaround that I'm too shameful to 
share :)

Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org