You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jackrabbit.apache.org by Christoph Kiehl <ch...@sulu3000.de> on 2007/06/08 12:50:15 UTC

Optimize search performance

Hi everyone,

I had a look at the search related code during the last days, because we need 
better performance for range queries on date fields as well as for sorting by 
date fields. These are my thoughts so far:

1. Wouldn't it make sense to exclude the index for the "jcr:system" tree (which 
is located at repository/index by default) if the query to execute doesn't 
include items from the "jcr:system" tree.
Take for example a query like "my:app//element(*, foo:bar)". This query only 
searches for nodes located under "my:app" which excludes nodes from "jcr:system" 
and therefore doesn't need to search in the "jcr:system" index.
As the "jcr:system" might grow quite quickly if you create a lot versions it 
might be worth to exclude it.
I'm not sure though how hard it would be to find out if a query needs to include 
the "jcr:system" index.

2. Lucene uses the FieldCaches to speed up sorting and range queries which is 
exactly what we are after. Those FieldCaches are per IndexReader.
Jackrabbit uses an IndexSearcher which searches on a single IndexReader which is 
most likely to be an instance of CachingMultiReader. So on every search which 
builds up a FieldCache this FieldCache instance is associated with this instance 
of a CachingMultiReader. On successive queries which operate on this 
CachingMultiReader you will get a tremendous speedup for queries which can reuse 
  those associated FieldCache instances.
The problem is that Jackrabbit creates a new CachingMultiReader _everytime_ one 
of the underlying indexes are modified. This means if you just change _one_ item 
in the repository you will need to rebuild all those FieldCaches because the 
existing FieldCaches are associated with the old instance of CachingMultiReader.
This does not only lead to slow search response times for queries which contains 
range queries or are sorted by a field but also leads to massive memory 
consumption (depending on the size of your indexes) because there might be 
multiple instances of CachingMultiReaders in use if you have a scenario where a 
lot of queries and item modifications are executed concurrently.
As far as I understand the solution is to use a MultiSearcher which uses 
multiple IndexReaders. Since due to the merging strategy most of the indexes are 
stable this means the FieldCaches can be used for a much longer time.

I just tried to quickly modify SearchIndex to use a MultiSearcher with multiple 
IndexReaders wrapped by IndexSearchers but wasn't successful because somewhere 
in DescendantSelfAxisWeight the index readers are required to implement 
HierarchyResolver which ReadOnlyIndexReader doesn't.

So I thought I might ask you for some insight what you think about those two 
ideas before spending to much time walking down the wrong way ;)

Cheers,
Christoph

Re: Optimize search performance

Posted by Marcel Reutegger <ma...@gmx.net>.

Christoph Kiehl wrote:
> I just created an issue to which I attached an initial patch which 
> works quite well for us. It doesn't use MultiSearcher but extends 
> SharedFieldSortComparator to be aware of the underlying index segments. 
> Could you please review the patch?

will do. thanks a lot for your effort.

meanwhile I also created a patch which works a bit different than yours. I will 
attach it to the issue. I would be interested to see how it works in your 
environment.

regards
  marcel

Re: Optimize search performance

Posted by Christoph Kiehl <ch...@sulu3000.de>.

Marcel Reutegger wrote:

>> 2. Lucene uses the FieldCaches to speed up sorting and range queries 
>> which is exactly what we are after. Those FieldCaches are per 
>> IndexReader.
>> Jackrabbit uses an IndexSearcher which searches on a single 
>> IndexReader which is most likely to be an instance of 
>> CachingMultiReader. So on every search which builds up a FieldCache 
>> this FieldCache instance is associated with this instance of a 
>> CachingMultiReader. On successive queries which operate on this 
>> CachingMultiReader you will get a tremendous speedup for queries which 
>> can reuse  those associated FieldCache instances.
>> The problem is that Jackrabbit creates a new CachingMultiReader 
>> _everytime_ one of the underlying indexes are modified. This means if 
>> you just change _one_ item in the repository you will need to rebuild 
>> all those FieldCaches because the existing FieldCaches are associated 
>> with the old instance of CachingMultiReader.
>> This does not only lead to slow search response times for queries 
>> which contains range queries or are sorted by a field but also leads 
>> to massive memory consumption (depending on the size of your indexes) 
>> because there might be multiple instances of CachingMultiReaders in 
>> use if you have a scenario where a lot of queries and item 
>> modifications are executed concurrently.
>> As far as I understand the solution is to use a MultiSearcher which 
>> uses multiple IndexReaders. Since due to the merging strategy most of 
>> the indexes are stable this means the FieldCaches can be used for a 
>> much longer time.
> 
> Using a multi searcher means that you must be able to execute a query on 
> each of the index segments independently. this is not possible because 
> hierarchy information is always spread across multiple segments. e.g. a 
> node in one segment may reference a parent in another segment.

I just created an issue [1] to which I attached an initial patch which works 
quite well for us. It doesn't use MultiSearcher but extends 
SharedFieldSortComparator to be aware of the underlying index segments. Could 
you please review the patch?

Cheers,
Christoph

[1] http://issues.apache.org/jira/browse/JCR-974

Re: Optimize search performance

Posted by Christoph Kiehl <ch...@sulu3000.de>.

Marcel Reutegger wrote:

>> 1. Wouldn't it make sense to exclude the index for the "jcr:system" 
>> tree (which is located at repository/index by default) if the query to 
>> execute doesn't include items from the "jcr:system" tree.
>> Take for example a query like "my:app//element(*, foo:bar)". This 
>> query only searches for nodes located under "my:app" which excludes 
>> nodes from "jcr:system" and therefore doesn't need to search in the 
>> "jcr:system" index.
> 
> I think this is doable. Can you please file a jira issue about this?

I created JCR-967 and included your comments.

I'll check the code for further discussion of point 2.

Cheers,
Christoph

Re: Optimize search performance

Posted by Marcel Reutegger <ma...@day.com>.

Hi Christoph,

Christoph Kiehl wrote:
> I had a look at the search related code during the last days, because we 
> need better performance for range queries on date fields as well as for 
> sorting by date fields. These are my thoughts so far:
> 
> 1. Wouldn't it make sense to exclude the index for the "jcr:system" tree 
> (which is located at repository/index by default) if the query to 
> execute doesn't include items from the "jcr:system" tree.
> Take for example a query like "my:app//element(*, foo:bar)". This query 
> only searches for nodes located under "my:app" which excludes nodes from 
> "jcr:system" and therefore doesn't need to search in the "jcr:system" 
> index.

I think this is doable. Can you please file a jira issue about this?

> As the "jcr:system" might grow quite quickly if you create a lot 
> versions it might be worth to exclude it.
> I'm not sure though how hard it would be to find out if a query needs to 
> include the "jcr:system" index.

There are two relevant nodes in the query tree to find that out.

- what's the first location step and does it include the jcr:system tree? I 
think that's an easy one.
- does the query contain a jcr:deref node? If there is an intermediate result of 
a query may dereference into the jcr:system tree.

> 2. Lucene uses the FieldCaches to speed up sorting and range queries 
> which is exactly what we are after. Those FieldCaches are per IndexReader.
> Jackrabbit uses an IndexSearcher which searches on a single IndexReader 
> which is most likely to be an instance of CachingMultiReader. So on 
> every search which builds up a FieldCache this FieldCache instance is 
> associated with this instance of a CachingMultiReader. On successive 
> queries which operate on this CachingMultiReader you will get a 
> tremendous speedup for queries which can reuse  those associated 
> FieldCache instances.
> The problem is that Jackrabbit creates a new CachingMultiReader 
> _everytime_ one of the underlying indexes are modified. This means if 
> you just change _one_ item in the repository you will need to rebuild 
> all those FieldCaches because the existing FieldCaches are associated 
> with the old instance of CachingMultiReader.
> This does not only lead to slow search response times for queries which 
> contains range queries or are sorted by a field but also leads to 
> massive memory consumption (depending on the size of your indexes) 
> because there might be multiple instances of CachingMultiReaders in use 
> if you have a scenario where a lot of queries and item modifications are 
> executed concurrently.
> As far as I understand the solution is to use a MultiSearcher which uses 
> multiple IndexReaders. Since due to the merging strategy most of the 
> indexes are stable this means the FieldCaches can be used for a much 
> longer time.

this is all correct but does not work because. and you actually already found 
out why:

> I just tried to quickly modify SearchIndex to use a MultiSearcher with 
> multiple IndexReaders wrapped by IndexSearchers but wasn't successful 
> because somewhere in DescendantSelfAxisWeight the index readers are 
> required to implement HierarchyResolver which ReadOnlyIndexReader doesn't.

Using a multi searcher means that you must be able to execute a query on each of 
the index segments independently. this is not possible because hierarchy 
information is always spread across multiple segments. e.g. a node in one 
segment may reference a parent in another segment.

there's also another reason why a multi searcher is not the best solution. it 
requires that the fields of a returned FieldDoc contain the values of the 
indexed property. If there are lots of values to order the complete set of 
values needs to be read into memory. With the current implementation this is not 
needed because there is just a single FieldCache that uses integers instead of 
the real value. See class SharedFieldSortComparator [1]. the downside of this 
approach is that you cannot do a merge sort just using those integers.

a viable solution maybe is a combination of both approaches. use a FieldCache 
per index segment (which allows us to cache them for a longer period) but still 
use integer values for ordering of nodes within a segment. Then do a merge sort 
with a modified SharedFieldSortComparator that reads property values from the 
item state manager when nodes are compared across index segments. even though 
this requires reading property state, the performance shouldn't suffer too much, 
I think. the properties would be read anyway when the query result is iterated, 
so it shouldn't harm if they are read already during query execution.

regards
  marcel

[1] 
https://svn.apache.org/repos/asf/jackrabbit/tags/1.3/jackrabbit-core/src/main/java/org/apache/jackrabbit/core/query/lucene/SharedFieldSortComparator.java