You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Vladimir Kotal <vl...@oracle.com> on 2016/07/15 09:49:10 UTC

MmapDirectory and IndexReader reuse

Hi all,

when trying to identify bottlenecks in our application, I found that 
each search which involves multiple indexes is performing lots of 
mmap()/open() syscalls. This is a natural consequence of using 
MmapDirectory. So even if file system caches are properly warmed, this 
might add couple of seconds (depending on operating system or 
virtualization technology) to the request handling time, especially when 
the number of searched indexes is in hundreds (see 
https://github.com/OpenGrok/OpenGrok/issues/1116 for the gory detail).

I was wondering if we can amortize the syscall load by caching 
IndexReader objects. The search (which is done in webapp) looks like this:

 
https://github.com/OpenGrok/OpenGrok/blob/master/src/org/opensolaris/opengrok/search/SearchEngine.java#L203

and the idea would be to reuse each IndexReader until the next refresh 
of its pertaining index. This would avoid the syscalls during 
MmapDirectory.open().

My worry is what happens if indexer runs and writes to the index files 
while they are mmap'ed in memory - could this lead to corrupted search ?

The reindex work is visible here:

 
https://github.com/OpenGrok/OpenGrok/blob/master/src/org/opensolaris/opengrok/index/IndexDatabase.java#L341

The documents are added or removed in the call to indexDown() which is 
basically recursive traversal of directory tree. The commit happens only 
after the traversal is done.

The IndexWriter is setup with CREATE_OR_APPEND which I am not sure is 
desired for the reuse. If we can avoid index files to be written into 
(or at least make sure they are appended only) while reindexing, this 
should make the reuse possible I think.

Any comments are welcome,


v.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: MmapDirectory and IndexReader reuse

Posted by Vladimir Kotal <vl...@oracle.com>.
On 07/21/16 20:18, Michael McCandless wrote:
> Can't you pass your own SearcherFactory to SearcherManager to do that?

Aha ! That should be possible, thanks.


v.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: MmapDirectory and IndexReader reuse

Posted by Michael McCandless <lu...@mikemccandless.com>.
Can't you pass your own SearcherFactory to SearcherManager to do that?

Mike McCandless

http://blog.mikemccandless.com

On Thu, Jul 21, 2016 at 1:34 PM, Vladimir Kotal <vl...@oracle.com>
wrote:

> On 07/18/16 05:52 PM, Uwe Schindler wrote:
>
>> Hi,
>>
>> Have a separate searcher manager for every directory. On every incoming
>> search request, fetch the actual DirectoryReaders from the searcher
>> managers and build a MultiReader from it. This costs nothing, as
>> MultiReader is just a thin wrapper where no caching is involved. On top of
>> this MultiReader create an IndexSearcher (which is also cheap).
>>
>
> SearcherManeger's acquire() returns IndexSearcher. The trouble is that we
> need to have the IndexSearcher constructed with certain ExecutorService and
> there does not seem to be a way how to do that with SearcherManager.
>
> I wonder if SearcherManager can be extended to allow this. Otherwise we
> would have to use ReferenceManager and reimplement the functionality found
> in SearcherManager which I'd like to avoid.
>
>
>
> v.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: MmapDirectory and IndexReader reuse

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

a quick alternative is to get the underlying reader from IndexSearcher. You just don't use the IndexSearcher provided by SearcherManager, you just use it to access the underlying Index-/DirectoryReader, that you can feed into new MultiReaders. The good thing with that is:

- Every index has its own searcher manager and can therefor update itsself separately
- The search logic, just fetches the IndexSearcher from every searchermanager that it is interested in, but instead of using the IndexSearchers directly, just ignore them and fetch the underlying IndexReader: IndexSearcher#getIndexReader

There is no overhead inbvolved, as IndexSearcher is just a "cheap wrapper" around IndexReader. If you don't use it, don't care.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Vladimir Kotal [mailto:vladimir.kotal@oracle.com]
> Sent: Thursday, July 21, 2016 7:34 PM
> To: java-user@lucene.apache.org
> Subject: Re: MmapDirectory and IndexReader reuse
> 
> On 07/18/16 05:52 PM, Uwe Schindler wrote:
> > Hi,
> >
> > Have a separate searcher manager for every directory. On every incoming
> search request, fetch the actual DirectoryReaders from the searcher
> managers and build a MultiReader from it. This costs nothing, as MultiReader
> is just a thin wrapper where no caching is involved. On top of this
> MultiReader create an IndexSearcher (which is also cheap).
> 
> SearcherManeger's acquire() returns IndexSearcher. The trouble is that
> we need to have the IndexSearcher constructed with certain
> ExecutorService and there does not seem to be a way how to do that with
> SearcherManager.
> 
> I wonder if SearcherManager can be extended to allow this. Otherwise we
> would have to use ReferenceManager and reimplement the functionality
> found in SearcherManager which I'd like to avoid.
> 
> 
> v.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: MmapDirectory and IndexReader reuse

Posted by Vladimir Kotal <vl...@oracle.com>.
On 07/18/16 05:52 PM, Uwe Schindler wrote:
> Hi,
>
> Have a separate searcher manager for every directory. On every incoming search request, fetch the actual DirectoryReaders from the searcher managers and build a MultiReader from it. This costs nothing, as MultiReader is just a thin wrapper where no caching is involved. On top of this MultiReader create an IndexSearcher (which is also cheap).

SearcherManeger's acquire() returns IndexSearcher. The trouble is that 
we need to have the IndexSearcher constructed with certain 
ExecutorService and there does not seem to be a way how to do that with 
SearcherManager.

I wonder if SearcherManager can be extended to allow this. Otherwise we 
would have to use ReferenceManager and reimplement the functionality 
found in SearcherManager which I'd like to avoid.


v.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: MmapDirectory and IndexReader reuse

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

Have a separate searcher manager for every directory. On every incoming search request, fetch the actual DirectoryReaders from the searcher managers and build a MultiReader from it. This costs nothing, as MultiReader is just a thin wrapper where no caching is involved. On top of this MultiReader create an IndexSearcher (which is also cheap).

The actual search and caching is always executed on the segments collected from all indexes, the wrapping with MultiReaders and IndexSearchers is neglectible.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Vladimir Kotal [mailto:vladimir.kotal@oracle.com]
> Sent: Monday, July 18, 2016 5:00 PM
> To: java-user@lucene.apache.org
> Subject: Re: MmapDirectory and IndexReader reuse
> 
> On 07/15/16 12:00 PM, Uwe Schindler wrote:
> > Hi,
> >
> > You should keep the IndexReader open for the whole time! Otherwise
> there are more bottlenecks and slowdowns.
> >
> > If you are updating the Index, you should use SearcherManager that
> reopens the index reader accordingly. After updating the index you should
> also not completely close and reopen the index. SearcherManager uses the
> DirectoryReader.reopen() method, which just updates the "view" currently
> seen and involves minimal syscalls (none at all if nothing changes).
> 
> Is it somehow possible to use SearcherManager with MultiReader to search
> multiple indexes ? We need to have the IndexSearcher to be constructed
> from multiple IndexReaders and the executor with distinct thread pool.
> Looking at the documentation there does not seem to be a way to do that
> as the SearcherManager constructor works only on single directory.
> 
> 
> v.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: MmapDirectory and IndexReader reuse

Posted by Vladimir Kotal <vl...@oracle.com>.
On 07/15/16 12:00 PM, Uwe Schindler wrote:
> Hi,
>
> You should keep the IndexReader open for the whole time! Otherwise there are more bottlenecks and slowdowns.
>
> If you are updating the Index, you should use SearcherManager that reopens the index reader accordingly. After updating the index you should also not completely close and reopen the index. SearcherManager uses the DirectoryReader.reopen() method, which just updates the "view" currently seen and involves minimal syscalls (none at all if nothing changes).

Is it somehow possible to use SearcherManager with MultiReader to search 
multiple indexes ? We need to have the IndexSearcher to be constructed 
from multiple IndexReaders and the executor with distinct thread pool. 
Looking at the documentation there does not seem to be a way to do that 
as the SearcherManager constructor works only on single directory.


v.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: MmapDirectory and IndexReader reuse

Posted by Uwe Schindler <uw...@thetaphi.de>.
Hi,

You should keep the IndexReader open for the whole time! Otherwise there are more bottlenecks and slowdowns.

If you are updating the Index, you should use SearcherManager that reopens the index reader accordingly. After updating the index you should also not completely close and reopen the index. SearcherManager uses the DirectoryReader.reopen() method, which just updates the "view" currently seen and involves minimal syscalls (none at all if nothing changes).

> My worry is what happens if indexer runs and writes to the index files
> while they are mmap'ed in memory - could this lead to corrupted search ?

No, because Lucene never changes existing files. All stuff is done in new files which get visible after flushing/committing or reopening as described above. In addition merging of those immutable segments is done in the background while indexing, but all files currently referred by IndexReaders/IndexSearchers are still immutable and stay alive until the IndexReader is closed.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Vladimir Kotal [mailto:vladimir.kotal@oracle.com]
> Sent: Friday, July 15, 2016 11:49 AM
> To: java-user@lucene.apache.org
> Subject: MmapDirectory and IndexReader reuse
> 
> 
> Hi all,
> 
> when trying to identify bottlenecks in our application, I found that
> each search which involves multiple indexes is performing lots of
> mmap()/open() syscalls. This is a natural consequence of using
> MmapDirectory. So even if file system caches are properly warmed, this
> might add couple of seconds (depending on operating system or
> virtualization technology) to the request handling time, especially when
> the number of searched indexes is in hundreds (see
> https://github.com/OpenGrok/OpenGrok/issues/1116 for the gory detail).
> 
> I was wondering if we can amortize the syscall load by caching
> IndexReader objects. The search (which is done in webapp) looks like this:
> 
> 
> https://github.com/OpenGrok/OpenGrok/blob/master/src/org/opensolaris/
> opengrok/search/SearchEngine.java#L203
> 
> and the idea would be to reuse each IndexReader until the next refresh
> of its pertaining index. This would avoid the syscalls during
> MmapDirectory.open().
> 
> My worry is what happens if indexer runs and writes to the index files
> while they are mmap'ed in memory - could this lead to corrupted search ?
> 
> The reindex work is visible here:
> 
> 
> https://github.com/OpenGrok/OpenGrok/blob/master/src/org/opensolaris/
> opengrok/index/IndexDatabase.java#L341
> 
> The documents are added or removed in the call to indexDown() which is
> basically recursive traversal of directory tree. The commit happens only
> after the traversal is done.
> 
> The IndexWriter is setup with CREATE_OR_APPEND which I am not sure is
> desired for the reuse. If we can avoid index files to be written into
> (or at least make sure they are appended only) while reindexing, this
> should make the reuse possible I think.
> 
> Any comments are welcome,
> 
> 
> v.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org