You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Jason Rutherglen <ja...@gmail.com> on 2009/06/23 23:26:04 UTC

Improving TimeLimitedCollector

As we're revamping collectors, weights, and scorers, perhaps we
can push time limiting into the individual subscorers? Currently
on a boolean query, we're timing out the query at the top level
which doesn't work well if the subqueries exceed the time limit.

Re: Improving TimeLimitedCollector

Posted by Mark Harwood <ma...@yahoo.co.uk>.

Odd. I see you're responding to a message from Shai I didn't get. Some  
mail being dropped somewhere along the line..

> Why don't you use Thread.interrupt(), .isInterrupted() ?


Not sure where exactly you mean for that?

>>
>> I'm not sure I understand that - how can a thread run >1 activity
>> simultaneously anyway, and how's your impl in  
>> TimeLimitingIndexReader allows
>> it to do so?


I can have a single thread performing using multiple readers/writers  
and each reader/writer has different timed activities e.g.


reader1.startActivity(60000)
try
      do reader 1 stuff....
      reader2.startActivity(10000);
      try
          do reader 2 stuff.....

      finally
          reader2.endActivity();


finally
    reader1.endActivity();

Each reader has it's own list of threads and their anticipated  
timeouts in relation to that particular resource.


Bit of an edge case perhaps but anything based on statics precludes  
this type of code. An instance of a timeout object would give this  
flexibility but as you say - you would need to pass a timeoutable- 
activity context object around which is a bit more hassle. Flexibility  
vs simplicity decision to be made here.

>>
>> I don't mind working on that w/ you, if you want.


Appreciated. I'll move the timeout management logic out of reader into  
another class and open a Jira issue so we can move the discussion/ 
development there.

Cheers
Mark

Re: Improving TimeLimitedCollector

Posted by Earwin Burrfoot <ea...@gmail.com>.

Why don't you use Thread.interrupt(), .isInterrupted() ?

On Sat, Jun 27, 2009 at 16:16, Shai Erera<se...@gmail.com> wrote:
>> A downside of breaking it out into static methods like this is that a
>> thread cannot run >1 time-limited activity simultaneously but I guess that
>> might be a reasonable restriction.
>
> I'm not sure I understand that - how can a thread run >1 activity
> simultaneously anyway, and how's your impl in TimeLimitingIndexReader allows
> it to do so? You use the thread as a key to the map. Am I missing something?
>
> Anyway, I think we can let go of the static methods and make them instance
> methods. I think .. if I want to use time limited activities, I should
> create a TimeLimitedThreadActivity instance and pass it around, to
> TimeLimitingIndexReader (and maybe in the future to a similar **IndexWriter)
> and any other custom code I have which I want to put a time limit on.
>
> A static class has the advantage of not needing to pass it around
> everywhere, and is accessible from everywhere, so that if we discover that
> limiting on IndexReader is not enough, and we want some of the scorers to
> check more frequently if they should stop, we won't need to pass that
> instance all the way down to them.
>
> I don't mind keeping it static, but I also don't mind if it will be an
> instance passed around, since currently it's only passed to IndexReader.
>
> Are you going to open an issue for that? Seems like a nice addition to me.
> Do you think it should belong in core or contrib? If 'core' then if that's
> possible I'd like to see all timeout classes under one package, including
> TimeLimitingCollector (which until 2.9 we can safely move around as we
> want).
> I don't mind working on that w/ you, if you want.
>
> Shai
>
> On Sat, Jun 27, 2009 at 2:24 PM, Mark Harwood <ma...@yahoo.co.uk>
> wrote:
>>
>> Thanks for the feedback, Shai.
>> So I guess you're suggesting breaking this out into a general utility
>> class e.g. something like:
>> class TimeLimitedThreadActivity
>> {
>>         //called by client
>>         public static void startTimeLimitedActivity(long
>> maxTimePermitted).
>>         public static void endTimeLimitedActivity()
>>        //called by resources (reader/writers) that need to be shared
>> fairly by threads
>>       public static void checkActivityNotElapsed(); //throws some form of
>> runtime exception
>> }
>> A downside of breaking it out into static methods like this is that a
>> thread cannot run >1 time-limited activity simultaneously but I guess that
>> might be a reasonable restriction.
>>
>> >>Aside, how about using a PQ for the threads' times, or a TreeMap. That
>> >> will save looping over the collection to find the next candidate. Just an
>> >> implementation detail though.
>> Yep, that was one of the rough edges - I just wanted to get raw timings
>> first for the all the "is timed out?" checks we're injecting into reader
>> calls.
>> Cheers
>> Mark
>>
>> On 27 Jun 2009, at 11:37, Shai Erera wrote:
>>
>> I like the overall approach. However it's very local to an IndexReader.
>> I.e., if someone wanted to limit other operations (say indexing), or does
>> not use an IndexReader (for a Scorer impl maybe), one cannot reuse it.
>>
>> What if we factor out the timeout logic to a Timeout class (I think it can
>> be a static class, with the way you implemented it) and use it in
>> TimeLimitingIndexReader? That class can offer a method check() which will do
>> the internal logic (the 'if' check and throw exception). It will be similar
>> to the current ensureOpen() followed by an operation.
>>
>> It might be considered more expensive since it won't check a boolean, but
>> instead call a check() method, but it will be more reusable. Also,
>> ensureOpen today is also a method call, so I don't think Timeout.check() is
>> that bad. We can even later create a TimeLimitingIndexWriter and document
>> Timeout class for other usage by external code.
>>
>> Aside, how about using a PQ for the threads' times, or a TreeMap. That
>> will save looping over the collection to find the next candidate. Just an
>> implementation detail though.
>>
>> Shai
>>
>> On Sat, Jun 27, 2009 at 3:31 AM, Mark Harwood <ma...@yahoo.co.uk>
>> wrote:
>>>
>>> Going back to my post re TimeLimitedIndexReaders - here's an incomplete
>>> but functional prototype:
>>> http://www.inperspective.com/lucene/TimeLimitedIndexReader.java
>>> http://www.inperspective.com/lucene/TestTimeLimitedIndexReader.java
>>>
>>> The principle is that all reader accesses check a volatile variable
>>> indicating something may have timed out (no need to check thread locals
>>> etc.) If and only if a time out has been noted threadlocals are checked to
>>> see which thread should throw a timeout exception.
>>> All time-limited use of reader must be wrapped in try...finally calls to
>>> indicate the start and stop of a timed set of activities. A background
>>> thread maintains the next anticipated timeout deadline and simply waits
>>> until this is reached or the list of planned activities changes with new
>>> deadlines.
>>>
>>> Performance seems reasonable on my Wikipedia index:
>>> //some tests for heavy use of termenum/term docs
>>> Read term docs for 200000 terms  in 4755 ms using no timeout limit (warm
>>> up)
>>> Read term docs for 200000 terms  in 4320 ms using no timeout limit (warm
>>> up)
>>> Read term docs for 200000 terms  in 4320 ms using no timeout limit
>>> Read term docs for 200000 terms  in 4388 ms using  reader with
>>> time-limited access
>>> //Example query with heavy use of termEnum/termDocs
>>> +text:f* +text:a* +text:b* no time limit matched 1090041 docs in 2000 ms
>>> +text:f* +text:a* +text:b* time limited collector matched 1090041 docs in
>>> 1963 ms
>>> +text:f* +text:a* +text:b* time limited reader matched 1090041 docs in
>>> 2121 ms
>>> //Example fuzzy match burning CPU reading TermEnum
>>> text:accomodation~0.5 no time limit matched 192084 docs in 6428 ms
>>> text:accomodation~0.5 time limited collector matched 192084 docs in 5923
>>> ms
>>> text:accomodation~0.5 time limited reader matched 192084 docs in 5945 ms
>>>
>>> The reader approach to limiting time is slower but has these advantages :
>>> 1) Multiple reader activities can be time-limited rather than just single
>>> searches
>>> 2) No code changes required to scorers/queries/filters etc
>>> 3) Tasks that spend plenty of  time burning CPU before collection happens
>>> can be killed earlier
>>> I'm sure there's some thread safety issues to work through in my code and
>>> not all reader classes are wrapped (e.g. TermPositions) but the basics are
>>> there and seem to be functioning
>>> Thoughts?
>>
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Improving TimeLimitedCollector

Posted by Shai Erera <se...@gmail.com>.

>
> A downside of breaking it out into static methods like this is that a
> thread cannot run >1 time-limited activity simultaneously but I guess that
> might be a reasonable restriction.
>

I'm not sure I understand that - how can a thread run >1 activity
simultaneously anyway, and how's your impl in TimeLimitingIndexReader allows
it to do so? You use the thread as a key to the map. Am I missing something?

Anyway, I think we can let go of the static methods and make them instance
methods. I think .. if I want to use time limited activities, I should
create a TimeLimitedThreadActivity instance and pass it around, to
TimeLimitingIndexReader (and maybe in the future to a similar **IndexWriter)
and any other custom code I have which I want to put a time limit on.

A static class has the advantage of not needing to pass it around
everywhere, and is accessible from everywhere, so that if we discover that
limiting on IndexReader is not enough, and we want some of the scorers to
check more frequently if they should stop, we won't need to pass that
instance all the way down to them.

I don't mind keeping it static, but I also don't mind if it will be an
instance passed around, since currently it's only passed to IndexReader.

Are you going to open an issue for that? Seems like a nice addition to me.
Do you think it should belong in core or contrib? If 'core' then if that's
possible I'd like to see all timeout classes under one package, including
TimeLimitingCollector (which until 2.9 we can safely move around as we
want).
I don't mind working on that w/ you, if you want.

Shai

On Sat, Jun 27, 2009 at 2:24 PM, Mark Harwood <ma...@yahoo.co.uk>wrote:

> Thanks for the feedback, Shai.
> So I guess you're suggesting breaking this out into a general utility class
> e.g. something like:
>
> class TimeLimitedThreadActivity
> {
>         //called by client
>         public static void startTimeLimitedActivity(long maxTimePermitted).
>         public static void endTimeLimitedActivity()
>
>        //called by resources (reader/writers) that need to be shared fairly
> by threads
>       public static void checkActivityNotElapsed(); //throws some form of
> runtime exception
> }
>
> A downside of breaking it out into static methods like this is that a
> thread cannot run >1 time-limited activity simultaneously but I guess that
> might be a reasonable restriction.
>
>
> >>Aside, how about using a PQ for the threads' times, or a TreeMap. That
> will save looping over the collection to find the next candidate. Just an
> implementation detail though.
>
> Yep, that was one of the rough edges - I just wanted to get raw timings
> first for the all the "is timed out?" checks we're injecting into reader
> calls.
>
> Cheers
> Mark
>
>
> On 27 Jun 2009, at 11:37, Shai Erera wrote:
>
> I like the overall approach. However it's very local to an IndexReader.
> I.e., if someone wanted to limit other operations (say indexing), or does
> not use an IndexReader (for a Scorer impl maybe), one cannot reuse it.
>
> What if we factor out the timeout logic to a Timeout class (I think it can
> be a static class, with the way you implemented it) and use it in
> TimeLimitingIndexReader? That class can offer a method check() which will do
> the internal logic (the 'if' check and throw exception). It will be similar
> to the current ensureOpen() followed by an operation.
>
> It might be considered more expensive since it won't check a boolean, but
> instead call a check() method, but it will be more reusable. Also,
> ensureOpen today is also a method call, so I don't think Timeout.check() is
> that bad. We can even later create a TimeLimitingIndexWriter and document
> Timeout class for other usage by external code.
>
> Aside, how about using a PQ for the threads' times, or a TreeMap. That will
> save looping over the collection to find the next candidate. Just an
> implementation detail though.
>
> Shai
>
> On Sat, Jun 27, 2009 at 3:31 AM, Mark Harwood <ma...@yahoo.co.uk>wrote:
>
>> Going back to my post re TimeLimitedIndexReaders - here's an incomplete
>> but functional prototype:
>> http://www.inperspective.com/lucene/TimeLimitedIndexReader.java
>> http://www.inperspective.com/lucene/TestTimeLimitedIndexReader.java
>>
>>
>> The principle is that all reader accesses check a volatile variable
>> indicating something may have timed out (no need to check thread locals
>> etc.) If and only if a time out has been noted threadlocals are checked to
>> see which thread should throw a timeout exception.
>>
>> All time-limited use of reader must be wrapped in try...finally calls to
>> indicate the start and stop of a timed set of activities. A background
>> thread maintains the next anticipated timeout deadline and simply waits
>> until this is reached or the list of planned activities changes with new
>> deadlines.
>>
>>
>> Performance seems reasonable on my Wikipedia index:
>>
>> //some tests for heavy use of termenum/term docs
>> Read term docs for 200000 terms  in 4755 ms using no timeout limit (warm
>> up)
>> Read term docs for 200000 terms  in 4320 ms using no timeout limit (warm
>> up)
>> Read term docs for 200000 terms  in 4320 ms using no timeout limit
>> Read term docs for 200000 terms  in 4388 ms using  reader with
>> time-limited access
>>
>> //Example query with heavy use of termEnum/termDocs
>> +text:f* +text:a* +text:b* no time limit matched 1090041 docs in 2000 ms
>> +text:f* +text:a* +text:b* time limited collector matched 1090041 docs in
>> 1963 ms
>> +text:f* +text:a* +text:b* time limited reader matched 1090041 docs in
>> 2121 ms
>>
>> //Example fuzzy match burning CPU reading TermEnum
>> text:accomodation~0.5 no time limit matched 192084 docs in 6428 ms
>> text:accomodation~0.5 time limited collector matched 192084 docs in 5923
>> ms
>> text:accomodation~0.5 time limited reader matched 192084 docs in 5945 ms
>>
>>
>> The reader approach to limiting time is slower but has these advantages :
>>
>> 1) Multiple reader activities can be time-limited rather than just single
>> searches
>> 2) No code changes required to scorers/queries/filters etc
>> 3) Tasks that spend plenty of  time burning CPU before collection happens
>> can be killed earlier
>>
>> I'm sure there's some thread safety issues to work through in my code and
>> not all reader classes are wrapped (e.g. TermPositions) but the basics are
>> there and seem to be functioning
>>
>> Thoughts?
>>
>
>
>

Re: Improving TimeLimitedCollector

Posted by Mark Harwood <ma...@yahoo.co.uk>.

Thanks for the feedback, Shai.

So I guess you're suggesting breaking this out into a general utility  
class e.g. something like:

class TimeLimitedThreadActivity
{
         //called by client
         public static void startTimeLimitedActivity(long  
maxTimePermitted).
         public static void endTimeLimitedActivity()

        //called by resources (reader/writers) that need to be shared  
fairly by threads
       public static void checkActivityNotElapsed(); //throws some  
form of runtime exception
}

A downside of breaking it out into static methods like this is that a  
thread cannot run >1 time-limited activity simultaneously but I guess  
that might be a reasonable restriction.


 >>Aside, how about using a PQ for the threads' times, or a TreeMap.  
That will save looping over the collection to find the next candidate.  
Just an implementation detail though.

Yep, that was one of the rough edges - I just wanted to get raw  
timings first for the all the "is timed out?" checks we're injecting  
into reader calls.

Cheers
Mark


On 27 Jun 2009, at 11:37, Shai Erera wrote:

> I like the overall approach. However it's very local to an  
> IndexReader. I.e., if someone wanted to limit other operations (say  
> indexing), or does not use an IndexReader (for a Scorer impl maybe),  
> one cannot reuse it.
>
> What if we factor out the timeout logic to a Timeout class (I think  
> it can be a static class, with the way you implemented it) and use  
> it in TimeLimitingIndexReader? That class can offer a method check()  
> which will do the internal logic (the 'if' check and throw  
> exception). It will be similar to the current ensureOpen() followed  
> by an operation.
>
> It might be considered more expensive since it won't check a  
> boolean, but instead call a check() method, but it will be more  
> reusable. Also, ensureOpen today is also a method call, so I don't  
> think Timeout.check() is that bad. We can even later create a  
> TimeLimitingIndexWriter and document Timeout class for other usage  
> by external code.
>
> Aside, how about using a PQ for the threads' times, or a TreeMap.  
> That will save looping over the collection to find the next  
> candidate. Just an implementation detail though.
>
> Shai
>
> On Sat, Jun 27, 2009 at 3:31 AM, Mark Harwood  
> <ma...@yahoo.co.uk> wrote:
> Going back to my post re TimeLimitedIndexReaders - here's an  
> incomplete but functional prototype:
>
> http://www.inperspective.com/lucene/TimeLimitedIndexReader.java
> http://www.inperspective.com/lucene/TestTimeLimitedIndexReader.java
>
>
> The principle is that all reader accesses check a volatile variable  
> indicating something may have timed out (no need to check thread  
> locals etc.) If and only if a time out has been noted threadlocals  
> are checked to see which thread should throw a timeout exception.
>
> All time-limited use of reader must be wrapped in try...finally  
> calls to indicate the start and stop of a timed set of activities. A  
> background thread maintains the next anticipated timeout deadline  
> and simply waits until this is reached or the list of planned  
> activities changes with new deadlines.
>
>
> Performance seems reasonable on my Wikipedia index:
>
> //some tests for heavy use of termenum/term docs
> Read term docs for 200000 terms  in 4755 ms using no timeout limit  
> (warm up)
> Read term docs for 200000 terms  in 4320 ms using no timeout limit  
> (warm up)
> Read term docs for 200000 terms  in 4320 ms using no timeout limit
> Read term docs for 200000 terms  in 4388 ms using  reader with time- 
> limited access
>
> //Example query with heavy use of termEnum/termDocs
> +text:f* +text:a* +text:b* no time limit matched 1090041 docs in  
> 2000 ms
> +text:f* +text:a* +text:b* time limited collector matched 1090041  
> docs in 1963 ms
> +text:f* +text:a* +text:b* time limited reader matched 1090041 docs  
> in 2121 ms
>
> //Example fuzzy match burning CPU reading TermEnum
> text:accomodation~0.5 no time limit matched 192084 docs in 	6428 ms
> text:accomodation~0.5 time limited collector matched 192084 docs in 	 
> 5923 ms
> text:accomodation~0.5 time limited reader matched 192084 docs in 	 
> 5945 ms
>
>
> The reader approach to limiting time is slower but has these  
> advantages :
>
> 1) Multiple reader activities can be time-limited rather than just  
> single searches
> 2) No code changes required to scorers/queries/filters etc
> 3) Tasks that spend plenty of  time burning CPU before collection  
> happens can be killed earlier
>
> I'm sure there's some thread safety issues to work through in my  
> code and not all reader classes are wrapped (e.g. TermPositions) but  
> the basics are there and seem to be functioning
>
> Thoughts?
>

Re: Improving TimeLimitedCollector

Posted by Shai Erera <se...@gmail.com>.

I like the overall approach. However it's very local to an IndexReader.
I.e., if someone wanted to limit other operations (say indexing), or does
not use an IndexReader (for a Scorer impl maybe), one cannot reuse it.

What if we factor out the timeout logic to a Timeout class (I think it can
be a static class, with the way you implemented it) and use it in
TimeLimitingIndexReader? That class can offer a method check() which will do
the internal logic (the 'if' check and throw exception). It will be similar
to the current ensureOpen() followed by an operation.

It might be considered more expensive since it won't check a boolean, but
instead call a check() method, but it will be more reusable. Also,
ensureOpen today is also a method call, so I don't think Timeout.check() is
that bad. We can even later create a TimeLimitingIndexWriter and document
Timeout class for other usage by external code.

Aside, how about using a PQ for the threads' times, or a TreeMap. That will
save looping over the collection to find the next candidate. Just an
implementation detail though.

Shai

On Sat, Jun 27, 2009 at 3:31 AM, Mark Harwood <ma...@yahoo.co.uk>wrote:

> Going back to my post re TimeLimitedIndexReaders - here's an incomplete but
> functional prototype:
> http://www.inperspective.com/lucene/TimeLimitedIndexReader.java
> http://www.inperspective.com/lucene/TestTimeLimitedIndexReader.java
>
>
> The principle is that all reader accesses check a volatile variable
> indicating something may have timed out (no need to check thread locals
> etc.) If and only if a time out has been noted threadlocals are checked to
> see which thread should throw a timeout exception.
>
> All time-limited use of reader must be wrapped in try...finally calls to
> indicate the start and stop of a timed set of activities. A background
> thread maintains the next anticipated timeout deadline and simply waits
> until this is reached or the list of planned activities changes with new
> deadlines.
>
>
> Performance seems reasonable on my Wikipedia index:
>
> //some tests for heavy use of termenum/term docs
> Read term docs for 200000 terms  in 4755 ms using no timeout limit (warm
> up)
> Read term docs for 200000 terms  in 4320 ms using no timeout limit (warm
> up)
> Read term docs for 200000 terms  in 4320 ms using no timeout limit
> Read term docs for 200000 terms  in 4388 ms using  reader with time-limited
> access
>
> //Example query with heavy use of termEnum/termDocs
> +text:f* +text:a* +text:b* no time limit matched 1090041 docs in 2000 ms
> +text:f* +text:a* +text:b* time limited collector matched 1090041 docs in
> 1963 ms
> +text:f* +text:a* +text:b* time limited reader matched 1090041 docs in 2121
> ms
>
> //Example fuzzy match burning CPU reading TermEnum
> text:accomodation~0.5 no time limit matched 192084 docs in 6428 ms
> text:accomodation~0.5 time limited collector matched 192084 docs in 5923
> ms
> text:accomodation~0.5 time limited reader matched 192084 docs in 5945 ms
>
>
> The reader approach to limiting time is slower but has these advantages :
>
> 1) Multiple reader activities can be time-limited rather than just single
> searches
> 2) No code changes required to scorers/queries/filters etc
> 3) Tasks that spend plenty of  time burning CPU before collection happens
> can be killed earlier
>
> I'm sure there's some thread safety issues to work through in my code and
> not all reader classes are wrapped (e.g. TermPositions) but the basics are
> there and seem to be functioning
>
> Thoughts?
>

Re: Improving TimeLimitedCollector

Posted by Mark Harwood <ma...@yahoo.co.uk>.

Going back to my post re TimeLimitedIndexReaders - here's an  
incomplete but functional prototype:

http://www.inperspective.com/lucene/TimeLimitedIndexReader.java
http://www.inperspective.com/lucene/TestTimeLimitedIndexReader.java


The principle is that all reader accesses check a volatile variable  
indicating something may have timed out (no need to check thread  
locals etc.) If and only if a time out has been noted threadlocals are  
checked to see which thread should throw a timeout exception.

All time-limited use of reader must be wrapped in try...finally calls  
to indicate the start and stop of a timed set of activities. A  
background thread maintains the next anticipated timeout deadline and  
simply waits until this is reached or the list of planned activities  
changes with new deadlines.


Performance seems reasonable on my Wikipedia index:

//some tests for heavy use of termenum/term docs
Read term docs for 200000 terms  in 4755 ms using no timeout limit  
(warm up)
Read term docs for 200000 terms  in 4320 ms using no timeout limit  
(warm up)
Read term docs for 200000 terms  in 4320 ms using no timeout limit
Read term docs for 200000 terms  in 4388 ms using  reader with time- 
limited access

//Example query with heavy use of termEnum/termDocs
+text:f* +text:a* +text:b* no time limit matched 1090041 docs in 2000 ms
+text:f* +text:a* +text:b* time limited collector matched 1090041 docs  
in 1963 ms
+text:f* +text:a* +text:b* time limited reader matched 1090041 docs in  
2121 ms

//Example fuzzy match burning CPU reading TermEnum
text:accomodation~0.5 no time limit matched 192084 docs in 	6428 ms
text:accomodation~0.5 time limited collector matched 192084 docs in 	 
5923 ms
text:accomodation~0.5 time limited reader matched 192084 docs in 	5945  
ms


The reader approach to limiting time is slower but has these  
advantages :

1) Multiple reader activities can be time-limited rather than just  
single searches
2) No code changes required to scorers/queries/filters etc
3) Tasks that spend plenty of  time burning CPU before collection  
happens can be killed earlier

I'm sure there's some thread safety issues to work through in my code  
and not all reader classes are wrapped (e.g. TermPositions) but the  
basics are there and seem to be functioning

Thoughts?

Re: Improving TimeLimitedCollector

Posted by Shai Erera <se...@gmail.com>.

When this thread was focused on Scorers only, I wanted to offer the
following:

* Have a TimeLimitingQuery which will wrap another Query, and delegate all
the calls to it, except queryWeight, which will create TimeLimitingWeight.
* Have a TimeLimitingWeight which will wrap the original query's
QueryWeight, and impl its scorer(...) as: (1) create a Timeout object and
(2) call origQueryWeight.scorer(..., timeout).
* Add to QueryWeight.scorer() another parameter Timeout.

IndexSearcher will pass a null Timeout (or a NopTimeout) and if
TimeLimitingQuery is used by the user, it will create a true Timeout object.

Then, a Scorer impl can choose between ignoring the Timeout checks (for
which case we'll still have TimeLimitingCollector as a fallback), or call
periodically (every nextDoc(), advance()), timeout.check(), which will throw
TimeExceededException if the timeout has reached.

I then thought that if timeout is not required by an application, it will
pay the extrac checks for timeout != null, or in case of NopTimeout -
calling its empty check(), unnecessarily. Today, w/ TimeLimitingCollector,
if I don't instantiate it, I don't suffer from any extra overhead.

Mark, TimeLimitingIndexReader will basically have the same problem, only for
a wider range of operations. I.e., if I'm not interested in timeouts, any
operation that will be invoked will check for timeout. Are we ok with that
(e.g. adding null check or NopTimeout checks)?

If we are, then how about introducing a static Timeout class, with a
ThreadLocal member. Timeout.check() will invoke the ThreadLocal Timeout
member and check for timeout. The user will do
Timeout.start(MAX_ALLOWED_DURATION), which will set the ThreadLocal Timeout
and then any checks done by this thread will be timeout-able? At the end of
the operation, the user will need to do Timeout.cancel() or something like
that.

With this approach, we can discard TimeLimitingIndexReader, which we won't
need to create versions for Multi, Parallel etc., and use it wherever we
want. Or ... we can keep TimeLimitingIndexReader, and have it accept another
IndexReader, and override all the reasonable methods from IndexReader, doing
something like:
Timeout.start(MAX_ALLOWED_DURATION)
try {
  wrappedIndexReader.<method>();
} finally {
  Timeout.cancel();
}

That will be more convenient for the application I think.

Shai

On Thu, Jun 25, 2009 at 1:35 AM, Jason Rutherglen <
jason.rutherglen@gmail.com> wrote:

> This would be good however how would we obtain the thread? I
> believe this would require using a ThreadLocalish type of system
> which could be quite slow (to obtain the thread and lookup in
> the hashmap).
>
> One implementation I looked at before was to add this check in
> IndexReader.isDeleted (by overriding inside FilterIndexReader).
>
> Perhaps SegmentTermDocs is the best place for this, it could
> check a timeout boolean variable which can (somehow) be
> optional, meaning, the code doesn't always check this and may
> also use a different call path. The boolean variable wouldn't be
> volatile as I believe based on our previous discussions is
> slower to check than a regular variable. Hopefully this would be
> good enough and more accurate than our current implementation.
>
>
> On Wed, Jun 24, 2009 at 3:07 PM, Mark Harwood <ma...@yahoo.co.uk>wrote:
>
>>  I think the Collector approach makes the most sense to me, since it's the
>>> only object I fully control in the search process. I cannot control Query
>>> implementations, and I cannot control the decisions made by IndexSearcher.
>>> But I can always wrap someone else's Collector with TLC and pass it to
>>> search().
>>>
>>
>>
>> Isn't another option to have a TimeLimitedIndexReader?
>>
>> Clients would need to make a call before the start of any timeout-able
>> activity e.g.
>>
>>    timeLimitedReader.startActivity(MAX_ALLOWED_DURATION);  //sets a
>> ThreadLocal
>>   //run query etc
>>
>> All invocations on special IndexReader, TermEnum, TermDocs etc subclasses
>> can then periodically check to see if the calling thread has exceeded its
>> allotted time. This would be a potentially non-invasive way of making all
>> filters', queries', scorers' etc read activity subject to some time limit as
>> they all typically have to invoke IndexReader-related classes relatively
>> frequently as part of their processing.
>>
>> Not sure how quick you can make all the underlying time checks though.
>>
>> Cheers,
>> Mark
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>

Re: Improving TimeLimitedCollector

Posted by Jason Rutherglen <ja...@gmail.com>.

This would be good however how would we obtain the thread? I
believe this would require using a ThreadLocalish type of system
which could be quite slow (to obtain the thread and lookup in
the hashmap).

One implementation I looked at before was to add this check in
IndexReader.isDeleted (by overriding inside FilterIndexReader).

Perhaps SegmentTermDocs is the best place for this, it could
check a timeout boolean variable which can (somehow) be
optional, meaning, the code doesn't always check this and may
also use a different call path. The boolean variable wouldn't be
volatile as I believe based on our previous discussions is
slower to check than a regular variable. Hopefully this would be
good enough and more accurate than our current implementation.

On Wed, Jun 24, 2009 at 3:07 PM, Mark Harwood <ma...@yahoo.co.uk>wrote:

> I think the Collector approach makes the most sense to me, since it's the
>> only object I fully control in the search process. I cannot control Query
>> implementations, and I cannot control the decisions made by IndexSearcher.
>> But I can always wrap someone else's Collector with TLC and pass it to
>> search().
>>
>
>
> Isn't another option to have a TimeLimitedIndexReader?
>
> Clients would need to make a call before the start of any timeout-able
> activity e.g.
>
>    timeLimitedReader.startActivity(MAX_ALLOWED_DURATION);  //sets a
> ThreadLocal
>   //run query etc
>
> All invocations on special IndexReader, TermEnum, TermDocs etc subclasses
> can then periodically check to see if the calling thread has exceeded its
> allotted time. This would be a potentially non-invasive way of making all
> filters', queries', scorers' etc read activity subject to some time limit as
> they all typically have to invoke IndexReader-related classes relatively
> frequently as part of their processing.
>
> Not sure how quick you can make all the underlying time checks though.
>
> Cheers,
> Mark
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Improving TimeLimitedCollector

Posted by Mark Harwood <ma...@yahoo.co.uk>.

> I think the Collector approach makes the most sense to me, since  
> it's the only object I fully control in the search process. I cannot  
> control Query implementations, and I cannot control the decisions  
> made by IndexSearcher. But I can always wrap someone else's  
> Collector with TLC and pass it to search().


Isn't another option to have a TimeLimitedIndexReader?

Clients would need to make a call before the start of any timeout-able  
activity e.g.

     timeLimitedReader.startActivity(MAX_ALLOWED_DURATION);  //sets a  
ThreadLocal
    //run query etc

All invocations on special IndexReader, TermEnum, TermDocs etc  
subclasses can then periodically check to see if the calling thread  
has exceeded its allotted time. This would be a potentially non- 
invasive way of making all filters', queries', scorers' etc read  
activity subject to some time limit as they all typically have to  
invoke IndexReader-related classes relatively frequently as part of  
their processing.

Not sure how quick you can make all the underlying time checks though.

Cheers,
Mark


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Improving TimeLimitedCollector

Posted by Jason Rutherglen <ja...@gmail.com>.

Good point!  Queries that timeout can be lengthy nested boolean queries so
perhaps there's some other approach besides checking for each hit and
minimize CPU usage?

On Wed, Jun 24, 2009 at 2:25 AM, Earwin Burrfoot <ea...@gmail.com> wrote:

> Having scorers check timeouts while advancing will definetly increase
> the frequency of said timeouts.
>
> On Wed, Jun 24, 2009 at 13:13, eks dev<ek...@yahoo.co.uk> wrote:
> > Re: "I think such a parameter should not exist on individual search
> methods
> > since it's more of a global setting (i.e., I want my searches to be
> limited
> > to 5 seconds, always, not just for a particular query). Right?"
> >
> > I am not sure about this one, we had cases where one phisical index
> served
> > two logical indices with different requirements for clients. having
> Timeout
> > settable per Query is nice to have.
> >
> > At the end of day, with such timeout you support Quality/Time compromise
> > settings:
> > "if you need all results, be ready to wait longer and set longer timeout"
> > "if you need SOME results quickly than reduce this timeout"
> >
> > that should be idealy user decision
> >
> > ________________________________
> > From: Shai Erera <se...@gmail.com>
> > To: java-dev@lucene.apache.org
> > Sent: Wednesday, 24 June, 2009 10:55:50
> > Subject: Re: Improving TimeLimitedCollector
> >
> > But TimeLimitingCollector's logic is coded in its collect() method. The
> top
> > scorer calls nextDoc() or advance() on all its sub-scorers, and only when
> a
> > match is found it calls collect().
> >
> > If we want the sub-scorers to check whether they should abort, we'd need
> to
> > revamp (liked the word :)) TimeLimitingCollector, to be something like
> > CheckAbort SegmentMerger uses. I.e., the top scorer will pass such an
> > instance to its sub scorers, which will call a TimeLimit.check() or
> > something and if the time limit has expired this call will throw a
> > TimeExceededException (like TLC).
> >
> > We can enable this by adding another parameter to IndexSearcher whether
> > searches should be limited by time, and what's the time limit. It will
> then
> > instantiate that object and pass it to its Scorer and so on. I think such
> a
> > parameter should not exist on individual search methods since it's more
> of a
> > global setting (i.e., I want my searches to be limited to 5 seconds,
> always,
> > not just for a particular query). Right?
> >
> > Another option would be to add a setTimeout method on Query, which will
> use
> > it when it constructs its Scorer. The shortcoming of this is that if I
> want
> > to use someone else's query which did not implement setTimeout, then I'll
> > need to build a TimeOutQueryWrapper that will wrap a Query, and implement
> > the timeout logic, but that's get complicated.
> >
> > I think the Collector approach makes the most sense to me, since it's the
> > only object I fully control in the search process. I cannot control Query
> > implementations, and I cannot control the decisions made by
> IndexSearcher.
> > But I can always wrap someone else's Collector with TLC and pass it to
> > search().
> >
> > Shai
> >
> > On Wed, Jun 24, 2009 at 12:26 AM, Jason Rutherglen
> > <ja...@gmail.com> wrote:
> >>
> >> As we're revamping collectors, weights, and scorers, perhaps we
> >> can push time limiting into the individual subscorers? Currently
> >> on a boolean query, we're timing out the query at the top level
> >> which doesn't work well if the subqueries exceed the time limit.
> >
> >
> >
>
>
>
> --
> Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
> Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
> ICQ: 104465785
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Re: Improving TimeLimitedCollector

Posted by Earwin Burrfoot <ea...@gmail.com>.

Having scorers check timeouts while advancing will definetly increase
the frequency of said timeouts.

On Wed, Jun 24, 2009 at 13:13, eks dev<ek...@yahoo.co.uk> wrote:
> Re: "I think such a parameter should not exist on individual search methods
> since it's more of a global setting (i.e., I want my searches to be limited
> to 5 seconds, always, not just for a particular query). Right?"
>
> I am not sure about this one, we had cases where one phisical index served
> two logical indices with different requirements for clients. having Timeout
> settable per Query is nice to have.
>
> At the end of day, with such timeout you support Quality/Time compromise
> settings:
> "if you need all results, be ready to wait longer and set longer timeout"
> "if you need SOME results quickly than reduce this timeout"
>
> that should be idealy user decision
>
> ________________________________
> From: Shai Erera <se...@gmail.com>
> To: java-dev@lucene.apache.org
> Sent: Wednesday, 24 June, 2009 10:55:50
> Subject: Re: Improving TimeLimitedCollector
>
> But TimeLimitingCollector's logic is coded in its collect() method. The top
> scorer calls nextDoc() or advance() on all its sub-scorers, and only when a
> match is found it calls collect().
>
> If we want the sub-scorers to check whether they should abort, we'd need to
> revamp (liked the word :)) TimeLimitingCollector, to be something like
> CheckAbort SegmentMerger uses. I.e., the top scorer will pass such an
> instance to its sub scorers, which will call a TimeLimit.check() or
> something and if the time limit has expired this call will throw a
> TimeExceededException (like TLC).
>
> We can enable this by adding another parameter to IndexSearcher whether
> searches should be limited by time, and what's the time limit. It will then
> instantiate that object and pass it to its Scorer and so on. I think such a
> parameter should not exist on individual search methods since it's more of a
> global setting (i.e., I want my searches to be limited to 5 seconds, always,
> not just for a particular query). Right?
>
> Another option would be to add a setTimeout method on Query, which will use
> it when it constructs its Scorer. The shortcoming of this is that if I want
> to use someone else's query which did not implement setTimeout, then I'll
> need to build a TimeOutQueryWrapper that will wrap a Query, and implement
> the timeout logic, but that's get complicated.
>
> I think the Collector approach makes the most sense to me, since it's the
> only object I fully control in the search process. I cannot control Query
> implementations, and I cannot control the decisions made by IndexSearcher.
> But I can always wrap someone else's Collector with TLC and pass it to
> search().
>
> Shai
>
> On Wed, Jun 24, 2009 at 12:26 AM, Jason Rutherglen
> <ja...@gmail.com> wrote:
>>
>> As we're revamping collectors, weights, and scorers, perhaps we
>> can push time limiting into the individual subscorers? Currently
>> on a boolean query, we're timing out the query at the top level
>> which doesn't work well if the subqueries exceed the time limit.
>
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Improving TimeLimitedCollector

Posted by eks dev <ek...@yahoo.co.uk>.

Re: "I think such a parameter should not exist on individual search methods
since it's more of a global setting (i.e., I want my searches to be
limited to 5 seconds, always, not just for a particular query). Right?"

I am not sure about this one, we had cases where one phisical index served two logical indices with different requirements for clients. having Timeout settable per Query is nice to have. 

At the end of day, with such timeout you support Quality/Time compromise settings:
"if you need all results, be ready to wait longer and set longer timeout"
"if you need SOME results quickly than reduce this timeout"

that should be idealy user decision 




________________________________
From: Shai Erera <se...@gmail.com>
To: java-dev@lucene.apache.org
Sent: Wednesday, 24 June, 2009 10:55:50
Subject: Re: Improving TimeLimitedCollector


But TimeLimitingCollector's logic is coded in its collect() method. The top scorer calls nextDoc() or advance() on all its sub-scorers, and only when a match is found it calls collect().

If we want the sub-scorers to check whether they should abort, we'd need to revamp (liked the word :)) TimeLimitingCollector, to be something like CheckAbort SegmentMerger uses. I.e., the top scorer will pass such an instance to its sub scorers, which will call a TimeLimit.check() or something and if the time limit has expired this call will throw a TimeExceededException (like TLC).

We can enable this by adding another parameter to IndexSearcher whether searches should be limited by time, and what's the time limit. It will then instantiate that object and pass it to its Scorer and so on. I think such a parameter should not exist on individual search methods since it's more of a global setting (i.e., I want my searches to be limited to 5 seconds, always, not just for a particular query). Right?

Another option would be to add a setTimeout method on Query, which will use it when it constructs its Scorer. The shortcoming of this is that if I want to use someone else's query which did not implement setTimeout, then I'll need to build a TimeOutQueryWrapper that will wrap a Query, and implement the timeout logic, but that's get complicated.

I think the Collector approach makes the most sense to me, since it's the only object I fully control in the search process. I cannot control Query implementations, and I cannot control the decisions made by IndexSearcher. But I can always wrap someone else's Collector with TLC and pass it to search().

Shai


On Wed, Jun 24, 2009 at 12:26 AM, Jason Rutherglen <ja...@gmail.com> wrote:

As we're revamping collectors, weights, and scorers, perhaps we
can push time limiting into the individual subscorers? Currently
on a boolean query, we're timing out the query at the top level
which doesn't work well if the subqueries exceed the time limit.

Re: Improving TimeLimitedCollector

Posted by Shai Erera <se...@gmail.com>.

But TimeLimitingCollector's logic is coded in its collect() method. The top
scorer calls nextDoc() or advance() on all its sub-scorers, and only when a
match is found it calls collect().

If we want the sub-scorers to check whether they should abort, we'd need to
revamp (liked the word :)) TimeLimitingCollector, to be something like
CheckAbort SegmentMerger uses. I.e., the top scorer will pass such an
instance to its sub scorers, which will call a TimeLimit.check() or
something and if the time limit has expired this call will throw a
TimeExceededException (like TLC).

We can enable this by adding another parameter to IndexSearcher whether
searches should be limited by time, and what's the time limit. It will then
instantiate that object and pass it to its Scorer and so on. I think such a
parameter should not exist on individual search methods since it's more of a
global setting (i.e., I want my searches to be limited to 5 seconds, always,
not just for a particular query). Right?

Another option would be to add a setTimeout method on Query, which will use
it when it constructs its Scorer. The shortcoming of this is that if I want
to use someone else's query which did not implement setTimeout, then I'll
need to build a TimeOutQueryWrapper that will wrap a Query, and implement
the timeout logic, but that's get complicated.

I think the Collector approach makes the most sense to me, since it's the
only object I fully control in the search process. I cannot control Query
implementations, and I cannot control the decisions made by IndexSearcher.
But I can always wrap someone else's Collector with TLC and pass it to
search().

Shai

On Wed, Jun 24, 2009 at 12:26 AM, Jason Rutherglen <
jason.rutherglen@gmail.com> wrote:

> As we're revamping collectors, weights, and scorers, perhaps we
> can push time limiting into the individual subscorers? Currently
> on a boolean query, we're timing out the query at the top level
> which doesn't work well if the subqueries exceed the time limit.
>