You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by David Causse <no...@laposte.net> on 2016/12/21 12:27:51 UTC
TimeLimitingCollector accuracy
Hi,
This subject has been discussed in the past but I don't think that any
real solution was implemented yet.
Here is a small test case to illustrate the problem:
https://github.com/nomoa/lucene-solr/commit/2f025b18899038c8606da64c2cf9f4e1f643607f#diff-65ae49ceb38e45a3fc05115be5e61a2dR387
This test will print:
Time waited on a slow query that matches all docs: 1109
Time waited on a slow query that matches no docs: 137258
The problem is that the time check is "passive", meaning that on large
segments if the query is slow and matches no documents the timeout is
very inaccurate making it nearly impossible to adjust client timeout vs
collector timeout.
It happens to me where I have a query that implements a TwoPhaseIterator
with an approximation that can be really bad not to say completely wrong
(regex search on stored content with an approximation based on extracted
tri-grams).
Another problem I discovered is that if the query is accepted by the
QueryCache it will eagerly set its bitset bypassing the Collector.
Reading
https://www.mail-archive.com/java-dev@lucene.apache.org/msg25694.html I
see that one suggested solution was to move the timeout check at a lower
level (in the scorers) but it raised some concerns about checking the
timeout too frequently.
But given that some efforts have been done to separate sub scorers from
"top-level" scorers (see
https://issues.apache.org/jira/browse/LUCENE-5487) would it make sense
now to make BulkScorers aware of some time constraints?
On my side, as a workaround to prevent catastrophes I'll probably
continue to implement a circuit breaker in my TwoPhaseIterator#matches
to either stop doing costly operation by returning false or by throwing
an exception.
Lastly, I think it could help me to workaround this problem if the
constructor of TimeExceededException was public, are there any reasons
for this constructor to be private? Would it break important workflows
if a scorer starts to throw this exception? It'd allow me to still
return partial results.
Thanks for your help
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: TimeLimitingCollector accuracy
Posted by David Causse <no...@laposte.net>.
Le 21/12/2016 � 13:27, David Causse a �crit :
> But given that some efforts have been done to separate sub scorers
> from "top-level" scorers (see
> https://issues.apache.org/jira/browse/LUCENE-5487) would it make sense
> now to make BulkScorers aware of some time constraints?
Looking a bit closer I don't think this could help to resolve accuracy
problems, maybe it'd help a bit when a TwoPhaseIterator is available to
check the timeout before calling twoPhase#matches()...
Anyways, if someone has suggestions on how to write a time-bounded query
that wants to return partial results (and still informs the Collector
that it's partial) given that the costly operation happens in
twoPhase#matches() it'd be much appreciated.
Currently I ended up writing an ugly workaround that I'm too shameful to
share :)
Thanks!
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org