You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Vitaly Funstein <vf...@gmail.com> on 2013/10/11 04:01:04 UTC

Cost of keeping around IndexReader instances

Hello,

I am trying to weigh some ideas for implementing paged search functionality
in our system, which has these basic requirements:

   - Using Solr is not an option (at the moment).
   - Any Lucene 4.x version can be used.
   - Result pagination is driven by the user application code.
   - User app can request a subset of results, without sequentially
   iterating from start, by specifying start/end of range. The subset must
   correspond to exact part of the full set that matches specified offsets, if
   that had been requested to begin with, i.e. for each query, result set must
   be "stable.
   - Result set must also be detached from live data, i.e. concurrent
   mutations must not be reflected in results, throughout lifecycle of the
   whole set.

At the moment, I have come up with two different approaches to solve this,
and would like some input.

In each case, the common part is to use ReaderManager, tied to IndexWriter
on the index. For each new query received, call
ReaderManager.maybeRefresh(), followed by acquire(), but also do the
refresh in the background, on a timer - this is as recommended by the docs.
But here are the differences.

   1. Initial idea
      - When a new query is executed, I cache the DirectoryReader instance
      returned by acquire(), associating with the query itself.
      - Use a simple custom Collector that slurps in all doc ids for
      matches, and keeps them in memory, in a plain array.
      - Subsequent requests for individual result "pages" for that query
      use the cached reader, to meet the "snapshot" requirement,
referencing doc
      ids at the requested offsets, i.e. IndexReader.document(id)... or I might
      use DocValues - that's still TBD, the key is that I reuse previously
      collected doc id.
      - When the app is done with the results, it indicates so and I call
      ReaderManager.release(), all collected ids are also cleared.
   2. Alternate method
      - On query execution, fully materialize result objects from search
      and persist them in binary form in a secondary index. These are basically
      serialized POJOs, indexed by a unique combination of
      requester/query/position ids.
      - Once generated, these results never change until deleted from the
      secondary index due to app-driven cleanup.
      - Result block requests run against this index, and not the live data.
      - After materializing result set, original IndexReader (from primary
      index) is released.
      - Thus, IndexReader instances are only kept around during query
      handling.

So the questions I have here are:

   - Is my assumption correct that once opened, a particular IndexReader
   instance cannot see subsequent changes to the index it was opened on? If
   so, does every open imply an inline commit on the writer?
   - What is the cost of keeping readers around in method 1, preventing
   them from closing - in terms of memory, file handle and locks?

Of course, in either approach, I plan on using a global result set limit to
prevent misuse, similar to how a database might set a limit on open result
cursors. But this limit would be dependent on the method chosen from above,
so any hints would be appreciated.

Re: Cost of keeping around IndexReader instances

Posted by Vitaly Funstein <vf...@gmail.com>.
UPDATE: I went with method 1, i.e. keeping IndexReader instances open
between requests. Which brings me back to the original questions - is there
any way of quantifying the impact of not closing a particular IndexReader?
Does this depend on # of segments per index, open file count etc?


On Thu, Oct 10, 2013 at 7:01 PM, Vitaly Funstein <vf...@gmail.com>wrote:

> Hello,
>
> I am trying to weigh some ideas for implementing paged search
> functionality in our system, which has these basic requirements:
>
>    - Using Solr is not an option (at the moment).
>    - Any Lucene 4.x version can be used.
>    - Result pagination is driven by the user application code.
>    - User app can request a subset of results, without sequentially
>    iterating from start, by specifying start/end of range. The subset must
>    correspond to exact part of the full set that matches specified offsets, if
>    that had been requested to begin with, i.e. for each query, result set must
>    be "stable.
>    - Result set must also be detached from live data, i.e. concurrent
>    mutations must not be reflected in results, throughout lifecycle of the
>    whole set.
>
> At the moment, I have come up with two different approaches to solve this,
> and would like some input.
>
> In each case, the common part is to use ReaderManager, tied to IndexWriter
> on the index. For each new query received, call
> ReaderManager.maybeRefresh(), followed by acquire(), but also do the
> refresh in the background, on a timer - this is as recommended by the docs.
> But here are the differences.
>
>    1. Initial idea
>       - When a new query is executed, I cache the DirectoryReader
>       instance returned by acquire(), associating with the query itself.
>       - Use a simple custom Collector that slurps in all doc ids for
>       matches, and keeps them in memory, in a plain array.
>       - Subsequent requests for individual result "pages" for that query
>       use the cached reader, to meet the "snapshot" requirement, referencing doc
>       ids at the requested offsets, i.e. IndexReader.document(id)... or I might
>       use DocValues - that's still TBD, the key is that I reuse previously
>       collected doc id.
>       - When the app is done with the results, it indicates so and I call
>       ReaderManager.release(), all collected ids are also cleared.
>    2. Alternate method
>       - On query execution, fully materialize result objects from search
>       and persist them in binary form in a secondary index. These are basically
>       serialized POJOs, indexed by a unique combination of
>       requester/query/position ids.
>       - Once generated, these results never change until deleted from the
>       secondary index due to app-driven cleanup.
>       - Result block requests run against this index, and not the live
>       data.
>       - After materializing result set, original IndexReader (from
>       primary index) is released.
>       - Thus, IndexReader instances are only kept around during query
>       handling.
>
> So the questions I have here are:
>
>    - Is my assumption correct that once opened, a particular IndexReader
>    instance cannot see subsequent changes to the index it was opened on? If
>    so, does every open imply an inline commit on the writer?
>    - What is the cost of keeping readers around in method 1, preventing
>    them from closing - in terms of memory, file handle and locks?
>
> Of course, in either approach, I plan on using a global result set limit
> to prevent misuse, similar to how a database might set a limit on open
> result cursors. But this limit would be dependent on the method chosen from
> above, so any hints would be appreciated.
>