You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Vitaly Funstein <vf...@gmail.com> on 2013/10/11 04:01:04 UTC
Cost of keeping around IndexReader instances
Hello,
I am trying to weigh some ideas for implementing paged search functionality
in our system, which has these basic requirements:
- Using Solr is not an option (at the moment).
- Any Lucene 4.x version can be used.
- Result pagination is driven by the user application code.
- User app can request a subset of results, without sequentially
iterating from start, by specifying start/end of range. The subset must
correspond to exact part of the full set that matches specified offsets, if
that had been requested to begin with, i.e. for each query, result set must
be "stable.
- Result set must also be detached from live data, i.e. concurrent
mutations must not be reflected in results, throughout lifecycle of the
whole set.
At the moment, I have come up with two different approaches to solve this,
and would like some input.
In each case, the common part is to use ReaderManager, tied to IndexWriter
on the index. For each new query received, call
ReaderManager.maybeRefresh(), followed by acquire(), but also do the
refresh in the background, on a timer - this is as recommended by the docs.
But here are the differences.
1. Initial idea
- When a new query is executed, I cache the DirectoryReader instance
returned by acquire(), associating with the query itself.
- Use a simple custom Collector that slurps in all doc ids for
matches, and keeps them in memory, in a plain array.
- Subsequent requests for individual result "pages" for that query
use the cached reader, to meet the "snapshot" requirement,
referencing doc
ids at the requested offsets, i.e. IndexReader.document(id)... or I might
use DocValues - that's still TBD, the key is that I reuse previously
collected doc id.
- When the app is done with the results, it indicates so and I call
ReaderManager.release(), all collected ids are also cleared.
2. Alternate method
- On query execution, fully materialize result objects from search
and persist them in binary form in a secondary index. These are basically
serialized POJOs, indexed by a unique combination of
requester/query/position ids.
- Once generated, these results never change until deleted from the
secondary index due to app-driven cleanup.
- Result block requests run against this index, and not the live data.
- After materializing result set, original IndexReader (from primary
index) is released.
- Thus, IndexReader instances are only kept around during query
handling.
So the questions I have here are:
- Is my assumption correct that once opened, a particular IndexReader
instance cannot see subsequent changes to the index it was opened on? If
so, does every open imply an inline commit on the writer?
- What is the cost of keeping readers around in method 1, preventing
them from closing - in terms of memory, file handle and locks?
Of course, in either approach, I plan on using a global result set limit to
prevent misuse, similar to how a database might set a limit on open result
cursors. But this limit would be dependent on the method chosen from above,
so any hints would be appreciated.
Re: Cost of keeping around IndexReader instances
Posted by Vitaly Funstein <vf...@gmail.com>.
UPDATE: I went with method 1, i.e. keeping IndexReader instances open
between requests. Which brings me back to the original questions - is there
any way of quantifying the impact of not closing a particular IndexReader?
Does this depend on # of segments per index, open file count etc?
On Thu, Oct 10, 2013 at 7:01 PM, Vitaly Funstein <vf...@gmail.com>wrote:
> Hello,
>
> I am trying to weigh some ideas for implementing paged search
> functionality in our system, which has these basic requirements:
>
> - Using Solr is not an option (at the moment).
> - Any Lucene 4.x version can be used.
> - Result pagination is driven by the user application code.
> - User app can request a subset of results, without sequentially
> iterating from start, by specifying start/end of range. The subset must
> correspond to exact part of the full set that matches specified offsets, if
> that had been requested to begin with, i.e. for each query, result set must
> be "stable.
> - Result set must also be detached from live data, i.e. concurrent
> mutations must not be reflected in results, throughout lifecycle of the
> whole set.
>
> At the moment, I have come up with two different approaches to solve this,
> and would like some input.
>
> In each case, the common part is to use ReaderManager, tied to IndexWriter
> on the index. For each new query received, call
> ReaderManager.maybeRefresh(), followed by acquire(), but also do the
> refresh in the background, on a timer - this is as recommended by the docs.
> But here are the differences.
>
> 1. Initial idea
> - When a new query is executed, I cache the DirectoryReader
> instance returned by acquire(), associating with the query itself.
> - Use a simple custom Collector that slurps in all doc ids for
> matches, and keeps them in memory, in a plain array.
> - Subsequent requests for individual result "pages" for that query
> use the cached reader, to meet the "snapshot" requirement, referencing doc
> ids at the requested offsets, i.e. IndexReader.document(id)... or I might
> use DocValues - that's still TBD, the key is that I reuse previously
> collected doc id.
> - When the app is done with the results, it indicates so and I call
> ReaderManager.release(), all collected ids are also cleared.
> 2. Alternate method
> - On query execution, fully materialize result objects from search
> and persist them in binary form in a secondary index. These are basically
> serialized POJOs, indexed by a unique combination of
> requester/query/position ids.
> - Once generated, these results never change until deleted from the
> secondary index due to app-driven cleanup.
> - Result block requests run against this index, and not the live
> data.
> - After materializing result set, original IndexReader (from
> primary index) is released.
> - Thus, IndexReader instances are only kept around during query
> handling.
>
> So the questions I have here are:
>
> - Is my assumption correct that once opened, a particular IndexReader
> instance cannot see subsequent changes to the index it was opened on? If
> so, does every open imply an inline commit on the writer?
> - What is the cost of keeping readers around in method 1, preventing
> them from closing - in terms of memory, file handle and locks?
>
> Of course, in either approach, I plan on using a global result set limit
> to prevent misuse, similar to how a database might set a limit on open
> result cursors. But this limit would be dependent on the method chosen from
> above, so any hints would be appreciated.
>