You are viewing a plain text version of this content. The canonical link for it is here.

Posted to oak-dev@jackrabbit.apache.org by Chetan Mehrotra <ch...@gmail.com> on 2014/04/08 17:51:55 UTC

Slow full text query performance and Lucene Index handling in Oak

Hi,

As part of OAK-1702 I have added a benchmark to compare the
performance of Full text query search with JR2

Based on approach taken (which might be wrong) I get following numbers

Apache Jackrabbit Oak 0.21.0-SNAPSHOT
# FullTextSearchTest               C     min     10%     50%     90%
  max       N
Oak-Mongo                          1      58      71     101     119
  287     610
Oak-Mongo-FDS                      1      50      51      52      58
  184    1106
Oak-Tar                            1      39      40      40      44
   64    1459
Oak-Tar-FDS                        1      53      54      55      64
  197    1030
Jackrabbit                         1       4       4       5       6
  231   11385

Which shows that JR2 performs lot better for full text queries and
subsequent queries are quite faster once Lucene has warmed up.

Looking at current usage of Lucene in Oak and the way we store and
access the Lucene indexes [2] I have couple of doubts

1. Multiple IndexSearcher instances - Current impl would create a new
IndexSearcher for every Lucene query as the OakDirectory uses is bound
to NodeState of executing JCR session. Compared to this in JR2 we
probably had a singleton IndexSearcher which was shared across all the
query execution path. This would potentially cause performance issue
as Lucene is effectively used in a state less way and it has to
perform initialization for every call. As [3] the IndexSearcher must
be shared

2. Index Access - Currently we have custom OakDirectory which provides
access to Lucene indexes stored in NodeStore. Even with SegmentStore
which has memory mapped file the random access used by Lucene would
probably be lot slower with OakDirectory in comparison to default
Lucene MMapDirectory. For small setups where Lucene index can be
accomodated on each node I think it would be better if the index is
access from file system

Are the above concerns valid and should we relook into how we are
using Lucene in Oak?

Chetan Mehrotra
[1] https://issues.apache.org/jira/browse/OAK-1702
[2] https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/OakDirectory.java
[3] http://wiki.apache.org/lucene-java/ImproveSearchingSpeed

Re: Slow full text query performance and Lucene Index handling in Oak

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

Do we still have the option to store the Lucene files in the file system?
If we have, maybe we could run the test with that option and see if it
improves performance? I'm not suggesting this is a solution, it's just one
step to better analyze things. And it might be easy to do.

Regards,
Thomas



On 08/04/14 17:51, "Chetan Mehrotra" <ch...@gmail.com> wrote:

>Hi,
>
>As part of OAK-1702 I have added a benchmark to compare the
>performance of Full text query search with JR2
>
>Based on approach taken (which might be wrong) I get following numbers
>
>Apache Jackrabbit Oak 0.21.0-SNAPSHOT
># FullTextSearchTest               C     min     10%     50%     90%
>  max       N
>Oak-Mongo                          1      58      71     101     119
>  287     610
>Oak-Mongo-FDS                      1      50      51      52      58
>  184    1106
>Oak-Tar                            1      39      40      40      44
>   64    1459
>Oak-Tar-FDS                        1      53      54      55      64
>  197    1030
>Jackrabbit                         1       4       4       5       6
>  231   11385
>
>Which shows that JR2 performs lot better for full text queries and
>subsequent queries are quite faster once Lucene has warmed up.
>
>Looking at current usage of Lucene in Oak and the way we store and
>access the Lucene indexes [2] I have couple of doubts
>
>1. Multiple IndexSearcher instances - Current impl would create a new
>IndexSearcher for every Lucene query as the OakDirectory uses is bound
>to NodeState of executing JCR session. Compared to this in JR2 we
>probably had a singleton IndexSearcher which was shared across all the
>query execution path. This would potentially cause performance issue
>as Lucene is effectively used in a state less way and it has to
>perform initialization for every call. As [3] the IndexSearcher must
>be shared
>
>2. Index Access - Currently we have custom OakDirectory which provides
>access to Lucene indexes stored in NodeStore. Even with SegmentStore
>which has memory mapped file the random access used by Lucene would
>probably be lot slower with OakDirectory in comparison to default
>Lucene MMapDirectory. For small setups where Lucene index can be
>accomodated on each node I think it would be better if the index is
>access from file system
>
>Are the above concerns valid and should we relook into how we are
>using Lucene in Oak?
>
>Chetan Mehrotra
>[1] https://issues.apache.org/jira/browse/OAK-1702
>[2] 
>https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/ja
>va/org/apache/jackrabbit/oak/plugins/index/lucene/OakDirectory.java
>[3] http://wiki.apache.org/lucene-java/ImproveSearchingSpeed

Re: Slow full text query performance and Lucene Index handling in Oak

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

We have results from a different test case with multiple threads (internal
id GRANITE-5572). We have 50 full thread dumps, and there I count:

* 259 cases of LuceneIndex.java line 365:
  IndexReader reader = DirectoryReader.open(directory);

* 43 cases of LuceneIndex.java line 379:
  TopDocs docs = searcher.search(query, Integer.MAX_VALUE);

* 13 cases of LuceneInde.java line 382:
  String path = reader.document(doc.doc, PATH_SELECTOR).get(PATH);

So, running the Lucene query and getting the paths is slow, but opening
the Lucene index is even slower in this test case.

Regards,
Thomas

On 09/04/14 13:44, "Jukka Zitting" <ju...@gmail.com> wrote:

>Hi,
>
>On Wed, Apr 9, 2014 at 7:24 AM, Chetan Mehrotra
><ch...@gmail.com> wrote:
>> ... the testcase only fetches the first result.
>
>Is that a common use case? To better simulate a normal usage scenario
>I'd make the benchmark fetch up to N results (where N is configurable,
>with default something like 20) and access the path and the title
>property of the matching nodes.
>
>BR,
>
>Jukka Zitting

Re: Slow full text query performance and Lucene Index handling in Oak

Posted by Chetan Mehrotra <ch...@gmail.com>.

Current update

1. Tommaso provided a patch (OAK-1702) to disable compression and that
also helps quite a bit
2. Currently we are storing the full tokenized text in Lucene Index
[1]. This would cause fetching of doc fields to be slower. On
disabling the storage the number improve quite a bit. This was added
as part of OAK-319 for supporting MLT

# FullTextSearchTest               C     min     10%     50%     90%
  max       N
Oak-Tar (codec)                    1       9       9      10      12
   41    5664
Oak-Tar (codec,mlt off)            1       7       8       8      10
   21    6921

Would look further

Chetan Mehrotra
[1] https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/FieldFactory.java#L44

On Wed, Apr 9, 2014 at 7:15 PM, Alex Parvulescu
<al...@gmail.com> wrote:
> Aside from the compression issue, there was another one related to the
> 'order by' clause. I saw Collections.sort taking up as far as 23% of the
> perf.
>
> I removed the order by temporarily so it doesn't get in the way of the
> Lucene stuff, but I think the QueryEngine should skip ordering results in
> this case.
>
>
>
>
> On Wed, Apr 9, 2014 at 3:31 PM, Tommaso Teofili
> <to...@gmail.com>wrote:
>
>> I'm looking into the Lucene codecs right now.
>>
>> Tommaso
>>
>>
>> 2014-04-09 15:20 GMT+02:00 Alex Parvulescu <al...@gmail.com>:
>>
>> > Profiling the result shows that quite a bit of time goes in
>> > org.apache.lucene.codecs.compressing.LZ4.decompress() (40%). This I
>> > think is part of Lucene 4.x and not present in 3.x. Any idea if I can
>> > disable compression?
>> >
>> > +1 I noticed that too, we should try to disable compression and compare
>> > results.
>> >
>> > alex
>> >
>> >
>> > On Wed, Apr 9, 2014 at 3:16 PM, Chetan Mehrotra
>> > <ch...@gmail.com>wrote:
>> >
>> > > On Wed, Apr 9, 2014 at 5:14 PM, Jukka Zitting <jukka.zitting@gmail.com
>> >
>> > > wrote:
>> > > > Is that a common use case? To better simulate a normal usage scenario
>> > > > I'd make the benchmark fetch up to N results (where N is
>> configurable,
>> > > > with default something like 20) and access the path and the title
>> > > > property of the matching nodes.
>> > >
>> > > I changed the logic of benchmark in http://svn.apache.org/r1585962.
>> > > With that JR2 slows down a bit
>> > >
>> > > # FullTextSearchTest               C     min     10%     50%     90%
>> > >   max       N
>> > > Oak-Tar                            1      34      35      36      39
>> > >    60    1639
>> > > Jackrabbit                         1       5       5       6       7
>> > >    68   10038
>> > >
>> > > Profiling the result shows that quite a bit of time goes in
>> > > org.apache.lucene.codecs.compressing.LZ4.decompress() (40%). This I
>> > > think is part of Lucene 4.x and not present in 3.x. Any idea if I can
>> > > disable compression?
>> > >
>> > > Chetan Mehrotra
>> > >
>> >
>>

Re: Slow full text query performance and Lucene Index handling in Oak

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

I not sure if Chetans test case matches the real world usage, if
Collections.sort takes up 23% of the performance... I have not seen
Collections.sort in other profiling results at all (so I guess it was less
than 1%). Also, I have seen opening the Lucene index takes much more time
in other tests than it takes for Chetans test case.

Regards,
Thomas

On 09/04/14 15:45, "Alex Parvulescu" <al...@gmail.com> wrote:

>Aside from the compression issue, there was another one related to the
>'order by' clause. I saw Collections.sort taking up as far as 23% of the
>perf.
>
>I removed the order by temporarily so it doesn't get in the way of the
>Lucene stuff, but I think the QueryEngine should skip ordering results in
>this case.
>
>
>
>
>On Wed, Apr 9, 2014 at 3:31 PM, Tommaso Teofili
><to...@gmail.com>wrote:
>
>> I'm looking into the Lucene codecs right now.
>>
>> Tommaso
>>
>>
>> 2014-04-09 15:20 GMT+02:00 Alex Parvulescu <al...@gmail.com>:
>>
>> > Profiling the result shows that quite a bit of time goes in
>> > org.apache.lucene.codecs.compressing.LZ4.decompress() (40%). This I
>> > think is part of Lucene 4.x and not present in 3.x. Any idea if I can
>> > disable compression?
>> >
>> > +1 I noticed that too, we should try to disable compression and
>>compare
>> > results.
>> >
>> > alex
>> >
>> >
>> > On Wed, Apr 9, 2014 at 3:16 PM, Chetan Mehrotra
>> > <ch...@gmail.com>wrote:
>> >
>> > > On Wed, Apr 9, 2014 at 5:14 PM, Jukka Zitting
>><jukka.zitting@gmail.com
>> >
>> > > wrote:
>> > > > Is that a common use case? To better simulate a normal usage
>>scenario
>> > > > I'd make the benchmark fetch up to N results (where N is
>> configurable,
>> > > > with default something like 20) and access the path and the title
>> > > > property of the matching nodes.
>> > >
>> > > I changed the logic of benchmark in http://svn.apache.org/r1585962.
>> > > With that JR2 slows down a bit
>> > >
>> > > # FullTextSearchTest               C     min     10%     50%     90%
>> > >   max       N
>> > > Oak-Tar                            1      34      35      36      39
>> > >    60    1639
>> > > Jackrabbit                         1       5       5       6       7
>> > >    68   10038
>> > >
>> > > Profiling the result shows that quite a bit of time goes in
>> > > org.apache.lucene.codecs.compressing.LZ4.decompress() (40%). This I
>> > > think is part of Lucene 4.x and not present in 3.x. Any idea if I
>>can
>> > > disable compression?
>> > >
>> > > Chetan Mehrotra
>> > >
>> >
>>

Re: Slow full text query performance and Lucene Index handling in Oak

Posted by Alex Parvulescu <al...@gmail.com>.

Aside from the compression issue, there was another one related to the
'order by' clause. I saw Collections.sort taking up as far as 23% of the
perf.

I removed the order by temporarily so it doesn't get in the way of the
Lucene stuff, but I think the QueryEngine should skip ordering results in
this case.




On Wed, Apr 9, 2014 at 3:31 PM, Tommaso Teofili
<to...@gmail.com>wrote:

> I'm looking into the Lucene codecs right now.
>
> Tommaso
>
>
> 2014-04-09 15:20 GMT+02:00 Alex Parvulescu <al...@gmail.com>:
>
> > Profiling the result shows that quite a bit of time goes in
> > org.apache.lucene.codecs.compressing.LZ4.decompress() (40%). This I
> > think is part of Lucene 4.x and not present in 3.x. Any idea if I can
> > disable compression?
> >
> > +1 I noticed that too, we should try to disable compression and compare
> > results.
> >
> > alex
> >
> >
> > On Wed, Apr 9, 2014 at 3:16 PM, Chetan Mehrotra
> > <ch...@gmail.com>wrote:
> >
> > > On Wed, Apr 9, 2014 at 5:14 PM, Jukka Zitting <jukka.zitting@gmail.com
> >
> > > wrote:
> > > > Is that a common use case? To better simulate a normal usage scenario
> > > > I'd make the benchmark fetch up to N results (where N is
> configurable,
> > > > with default something like 20) and access the path and the title
> > > > property of the matching nodes.
> > >
> > > I changed the logic of benchmark in http://svn.apache.org/r1585962.
> > > With that JR2 slows down a bit
> > >
> > > # FullTextSearchTest               C     min     10%     50%     90%
> > >   max       N
> > > Oak-Tar                            1      34      35      36      39
> > >    60    1639
> > > Jackrabbit                         1       5       5       6       7
> > >    68   10038
> > >
> > > Profiling the result shows that quite a bit of time goes in
> > > org.apache.lucene.codecs.compressing.LZ4.decompress() (40%). This I
> > > think is part of Lucene 4.x and not present in 3.x. Any idea if I can
> > > disable compression?
> > >
> > > Chetan Mehrotra
> > >
> >
>

Re: Slow full text query performance and Lucene Index handling in Oak

Posted by Tommaso Teofili <to...@gmail.com>.

I'm looking into the Lucene codecs right now.

Tommaso


2014-04-09 15:20 GMT+02:00 Alex Parvulescu <al...@gmail.com>:

> Profiling the result shows that quite a bit of time goes in
> org.apache.lucene.codecs.compressing.LZ4.decompress() (40%). This I
> think is part of Lucene 4.x and not present in 3.x. Any idea if I can
> disable compression?
>
> +1 I noticed that too, we should try to disable compression and compare
> results.
>
> alex
>
>
> On Wed, Apr 9, 2014 at 3:16 PM, Chetan Mehrotra
> <ch...@gmail.com>wrote:
>
> > On Wed, Apr 9, 2014 at 5:14 PM, Jukka Zitting <ju...@gmail.com>
> > wrote:
> > > Is that a common use case? To better simulate a normal usage scenario
> > > I'd make the benchmark fetch up to N results (where N is configurable,
> > > with default something like 20) and access the path and the title
> > > property of the matching nodes.
> >
> > I changed the logic of benchmark in http://svn.apache.org/r1585962.
> > With that JR2 slows down a bit
> >
> > # FullTextSearchTest               C     min     10%     50%     90%
> >   max       N
> > Oak-Tar                            1      34      35      36      39
> >    60    1639
> > Jackrabbit                         1       5       5       6       7
> >    68   10038
> >
> > Profiling the result shows that quite a bit of time goes in
> > org.apache.lucene.codecs.compressing.LZ4.decompress() (40%). This I
> > think is part of Lucene 4.x and not present in 3.x. Any idea if I can
> > disable compression?
> >
> > Chetan Mehrotra
> >
>

Re: Slow full text query performance and Lucene Index handling in Oak

Posted by Alex Parvulescu <al...@gmail.com>.

Profiling the result shows that quite a bit of time goes in
org.apache.lucene.codecs.compressing.LZ4.decompress() (40%). This I
think is part of Lucene 4.x and not present in 3.x. Any idea if I can
disable compression?

+1 I noticed that too, we should try to disable compression and compare
results.

alex


On Wed, Apr 9, 2014 at 3:16 PM, Chetan Mehrotra
<ch...@gmail.com>wrote:

> On Wed, Apr 9, 2014 at 5:14 PM, Jukka Zitting <ju...@gmail.com>
> wrote:
> > Is that a common use case? To better simulate a normal usage scenario
> > I'd make the benchmark fetch up to N results (where N is configurable,
> > with default something like 20) and access the path and the title
> > property of the matching nodes.
>
> I changed the logic of benchmark in http://svn.apache.org/r1585962.
> With that JR2 slows down a bit
>
> # FullTextSearchTest               C     min     10%     50%     90%
>   max       N
> Oak-Tar                            1      34      35      36      39
>    60    1639
> Jackrabbit                         1       5       5       6       7
>    68   10038
>
> Profiling the result shows that quite a bit of time goes in
> org.apache.lucene.codecs.compressing.LZ4.decompress() (40%). This I
> think is part of Lucene 4.x and not present in 3.x. Any idea if I can
> disable compression?
>
> Chetan Mehrotra
>

Re: Slow full text query performance and Lucene Index handling in Oak

Posted by Chetan Mehrotra <ch...@gmail.com>.

On Wed, Apr 9, 2014 at 5:14 PM, Jukka Zitting <ju...@gmail.com> wrote:
> Is that a common use case? To better simulate a normal usage scenario
> I'd make the benchmark fetch up to N results (where N is configurable,
> with default something like 20) and access the path and the title
> property of the matching nodes.

I changed the logic of benchmark in http://svn.apache.org/r1585962.
With that JR2 slows down a bit

# FullTextSearchTest               C     min     10%     50%     90%
  max       N
Oak-Tar                            1      34      35      36      39
   60    1639
Jackrabbit                         1       5       5       6       7
   68   10038

Profiling the result shows that quite a bit of time goes in
org.apache.lucene.codecs.compressing.LZ4.decompress() (40%). This I
think is part of Lucene 4.x and not present in 3.x. Any idea if I can
disable compression?

Chetan Mehrotra

Re: Slow full text query performance and Lucene Index handling in Oak

Posted by Thomas Mueller <mu...@adobe.com>.

>
>also, I wonder if we shouldn't also profile the stack of underlying calls
>in the QueryEngine to measure how much time is spent there and how much
>time is spent in the specific QueryIndex implementation.

Analyzing full thread dumps will give you the statistical distribution,
which is quite accurate if you have enough data. In the full thread dumps
I saw so far, I didn't see a thread running within the query engine
itself. All (~300) threads where in the LuceneIndex for this case. So I
expect the query engine part is negligible (less than 1%).

Regards,
Thomas

Re: Slow full text query performance and Lucene Index handling in Oak

Posted by Tommaso Teofili <to...@gmail.com>.

2014-04-09 13:44 GMT+02:00 Jukka Zitting <ju...@gmail.com>:

> Hi,
>
> On Wed, Apr 9, 2014 at 7:24 AM, Chetan Mehrotra
> <ch...@gmail.com> wrote:
> > ... the testcase only fetches the first result.
>
> Is that a common use case? To better simulate a normal usage scenario
> I'd make the benchmark fetch up to N results (where N is configurable,
> with default something like 20) and access the path and the title
> property of the matching nodes.
>

+1

also, I wonder if we shouldn't also profile the stack of underlying calls
in the QueryEngine to measure how much time is spent there and how much
time is spent in the specific QueryIndex implementation.

Regards,
Tommaso



>
> BR,
>
> Jukka Zitting
>

Re: Slow full text query performance and Lucene Index handling in Oak

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Wed, Apr 9, 2014 at 7:24 AM, Chetan Mehrotra
<ch...@gmail.com> wrote:
> ... the testcase only fetches the first result.

Is that a common use case? To better simulate a normal usage scenario
I'd make the benchmark fetch up to N results (where N is configurable,
with default something like 20) and access the path and the title
property of the matching nodes.

BR,

Jukka Zitting

Re: Slow full text query performance and Lucene Index handling in Oak

Posted by Chetan Mehrotra <ch...@gmail.com>.

On Wed, Apr 9, 2014 at 3:00 PM, Alex Parvulescu
<al...@gmail.com> wrote:
>  - the patch assumes that there is and will be a single lucene index
> directly under the root node, which may not necessarily be the case. I
> agree this assumption holds now, but I would not introduce any changes that
> take away this flexibility.

That is not a problem per se as IndexReader starts with a count of 1.
So it would never go zero

The problem appears to be somewhere else. As I modified the code to
use shared IndexSearcher and native FSDirectory and still the
performance improvement was marginal.

The problem is occuring because the
org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndex#query [1]
currently does a eager initialization of cursor while the testcase
only fetches the first result. Compared to this the JR2 version does a
lazy evaluation. If put a break in loop (exit after first result) the
results are much better

Oak-Tar(break.shared searcher,fs)  1       2       2       3       3
  170   23204
Oak-Tar(break)                     1       5       5       5       6
   90   10593
Jackrabbit                         1       4       4       5       6
  231   11385

Now I am not sure if this a problem with the usecase taken. Or the
Lucene Index cursor management should be improved as in many case the
results would be multiple but the client code only makes use of
initial few results

Chetan Mehrotra
[1] https://github.com/apache/jackrabbit-oak/blob/trunk/oak-lucene/src/main/java/org/apache/jackrabbit/oak/plugins/index/lucene/LuceneIndex.java#L381-L409

Re: Slow full text query performance and Lucene Index handling in Oak

Posted by Alex Parvulescu <al...@gmail.com>.

Hi,

I agree with the idea to find a way to share the readers across threads.

Looking at the proposed patch I see a few problems:

 - the patch assumes that there is and will be a single lucene index
directly under the root node, which may not necessarily be the case. I
agree this assumption holds now, but I would not introduce any changes that
take away this flexibility.

 - browsing through I notice that this only helps with concurrent threads,
the call searcherManager.release translates into a decRef which means the
readers will be closed if I'm not mistaken.
This might explain the only marginal gain in perf.

We should be looking for a more general optimization where we might
leverage the fact that the index can be updated only each 5 seconds.
I was thinking that we can use the initial NodeState from the index content
node as a way to tell if it changed or not (using equals calls).
It would work in the following way: first call, no state in the
searchManager, take the provided NodeState (again random node state, the
index could be on any node of the repo), build an index reader based on
this, reuse it from how many threads you need. Cache this under
path/NodeSate/IndexReader.  On each subsequent call we can use the provided
NodeState to check if the cache is stale or not: path + NodeState.equals.

The biggest problem I see here is resource cleanup, as we'll not call
decRef on each search call, we need a way to get notified when the
application shuts down. Similar to Chetan's patch we can use a combo of
'Closeable' and '@Deactivate' but I'm not sure that will be enough outside
OSGi.

Take this with a grain of salt, I probably missed some aspects of the
problem along the way.

best,
alex

On Wed, Apr 9, 2014 at 10:43 AM, Chetan Mehrotra
<ch...@gmail.com>wrote:

> On Wed, Apr 9, 2014 at 12:25 PM, Marcel Reutegger <mr...@adobe.com>
> wrote:
> >> Since the Lucene index is in any case updated asynchronously, it
> >> should be fine for us to ignore the base NodeState of the current
> >> session and instead use an IndexSearcher based on the last state as
> >> updated by the async indexer. This would allow us to reuse the
> >> IndexSearcher over multiple queries.
> >
> > I was also wondering if it makes sense to share it across multiple
> > sessions performing a query to reduce the number of index readers
> > that may be open at the same time. however, this will likely also
> > reduce concurrency because we synchronize access to a single
> > session.
>
> I tried with one approach where I used a custom SerahcerManager based
> on Lucene SearcherManager. It obtains the root NodeState directly from
> NodeStore. As NodeStore can be accessed concurrently it should not
> have any impact on session concurrency
>
> With this change there is a slight improvement
>
> Oak-Tar                            1      39      40      40      44
>    64    1459
> Oak-Tar(Shared)                    1      32      33      34      36
>    61    1738
>
> So did not gave much boost (at least with approach taken). As I do not
> have much understanding of Lucene internal can someone review the
> approach taken and see if there are some major issues with it
>
>
> Chetan Mehrotra
> [1]
> https://issues.apache.org/jira/secure/attachment/12639366/OAK-1702-shared-indexer.patch
> [2]
> https://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/search/SearcherManager.html
>

Re: Slow full text query performance and Lucene Index handling in Oak

Posted by Chetan Mehrotra <ch...@gmail.com>.

On Wed, Apr 9, 2014 at 12:25 PM, Marcel Reutegger <mr...@adobe.com> wrote:
>> Since the Lucene index is in any case updated asynchronously, it
>> should be fine for us to ignore the base NodeState of the current
>> session and instead use an IndexSearcher based on the last state as
>> updated by the async indexer. This would allow us to reuse the
>> IndexSearcher over multiple queries.
>
> I was also wondering if it makes sense to share it across multiple
> sessions performing a query to reduce the number of index readers
> that may be open at the same time. however, this will likely also
> reduce concurrency because we synchronize access to a single
> session.

I tried with one approach where I used a custom SerahcerManager based
on Lucene SearcherManager. It obtains the root NodeState directly from
NodeStore. As NodeStore can be accessed concurrently it should not
have any impact on session concurrency

With this change there is a slight improvement

Oak-Tar                            1      39      40      40      44
   64    1459
Oak-Tar(Shared)                    1      32      33      34      36
   61    1738

So did not gave much boost (at least with approach taken). As I do not
have much understanding of Lucene internal can someone review the
approach taken and see if there are some major issues with it


Chetan Mehrotra
[1] https://issues.apache.org/jira/secure/attachment/12639366/OAK-1702-shared-indexer.patch
[2] https://lucene.apache.org/core/3_6_0/api/all/org/apache/lucene/search/SearcherManager.html

RE: Slow full text query performance and Lucene Index handling in Oak

Posted by Marcel Reutegger <mr...@adobe.com>.

Hi,

> Since the Lucene index is in any case updated asynchronously, it
> should be fine for us to ignore the base NodeState of the current
> session and instead use an IndexSearcher based on the last state as
> updated by the async indexer. This would allow us to reuse the
> IndexSearcher over multiple queries.

I was also wondering if it makes sense to share it across multiple
sessions performing a query to reduce the number of index readers
that may be open at the same time. however, this will likely also
reduce concurrency because we synchronize access to a single
session.

we should also try to re-open the existing reader, which is less
costly than creating a new reader. I'm not familiar anymore with
the most recent lucene version, but with the version used in
Jackrabbit 2.x this was possible and helped a lot.

Regards
 Marcel

Re: Slow full text query performance and Lucene Index handling in Oak

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Tue, Apr 8, 2014 at 11:51 AM, Chetan Mehrotra
<ch...@gmail.com> wrote:
> 1. Multiple IndexSearcher instances - Current impl would create a new
> IndexSearcher for every Lucene query as the OakDirectory uses is bound
> to NodeState of executing JCR session.

Since the Lucene index is in any case updated asynchronously, it
should be fine for us to ignore the base NodeState of the current
session and instead use an IndexSearcher based on the last state as
updated by the async indexer. This would allow us to reuse the
IndexSearcher over multiple queries.

> 2. Index Access - Currently we have custom OakDirectory which provides
> access to Lucene indexes stored in NodeStore. Even with SegmentStore
> which has memory mapped file the random access used by Lucene would
> probably be lot slower with OakDirectory in comparison to default
> Lucene MMapDirectory.

There's of course some extra overhead in going through Oak's Blob
interface, but I would be surprised if this turned out to be
significant and impossible to optimize as the frequently accessed
parts of the index would in either case be cached in memory. So I'd go
with approach 1 first and see where we are then before jumping to
conclusions on this one.

BR,

Jukka Zitting