You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by "Christoph Kiehl (JIRA)" <ji...@apache.org> on 2007/06/19 18:02:26 UTC

[jira] Created: (JCR-974) Manage Lucene FieldCaches per index segment

Manage Lucene FieldCaches per index segment
-------------------------------------------

                 Key: JCR-974
                 URL: https://issues.apache.org/jira/browse/JCR-974
             Project: Jackrabbit
          Issue Type: Improvement
          Components: query
    Affects Versions: 1.3
            Reporter: Christoph Kiehl


Jackrabbit uses an IndexSearcher which searches on a single IndexReader which is most likely to be an instance of CachingMultiReader. On every search that does sorting or range queries a FieldCache is populated and associated with this instance of a CachingMultiReader. On successive queries which operate on this CachingMultiReader you will get a tremendous speedup for queries which can reuse  those associated FieldCache instances.
The problem is that Jackrabbit creates a new CachingMultiReader _everytime_ one of the underlying indexes are modified. This means if you just change _one_ item in the repository you will need to rebuild all those FieldCaches because the existing FieldCaches are associated with the old instance of CachingMultiReader.
This does not only lead to slow search response times for queries which contains range queries or are sorted by a field but also leads to massive memory consumption (depending on the size of your indexes) because there might be multiple instances of CachingMultiReaders in use if you have a scenario where a lot of queries and item modifications are executed concurrently.
The goal is to keep those FieldCaches as long as possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (JCR-974) Manage Lucene FieldCaches per index segment

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JCR-974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated JCR-974:
------------------------------

          Component/s: jackrabbit-core
    Affects Version/s:     (was: 1.3)

> Manage Lucene FieldCaches per index segment
> -------------------------------------------
>
>                 Key: JCR-974
>                 URL: https://issues.apache.org/jira/browse/JCR-974
>             Project: Jackrabbit
>          Issue Type: Improvement
>          Components: jackrabbit-core, query
>            Reporter: Christoph Kiehl
>             Fix For: 1.4
>
>         Attachments: ItemStateManagerBasedSortComparator.patch, patch.txt, patch2.txt, patch3.txt
>
>
> Jackrabbit uses an IndexSearcher which searches on a single IndexReader which is most likely to be an instance of CachingMultiReader. On every search that does sorting or range queries a FieldCache is populated and associated with this instance of a CachingMultiReader. On successive queries which operate on this CachingMultiReader you will get a tremendous speedup for queries which can reuse  those associated FieldCache instances.
> The problem is that Jackrabbit creates a new CachingMultiReader _everytime_ one of the underlying indexes are modified. This means if you just change _one_ item in the repository you will need to rebuild all those FieldCaches because the existing FieldCaches are associated with the old instance of CachingMultiReader.
> This does not only lead to slow search response times for queries which contains range queries or are sorted by a field but also leads to massive memory consumption (depending on the size of your indexes) because there might be multiple instances of CachingMultiReaders in use if you have a scenario where a lot of queries and item modifications are executed concurrently.
> The goal is to keep those FieldCaches as long as possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (JCR-974) Manage Lucene FieldCaches per index segment

Posted by "Christoph Kiehl (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/JCR-974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506232 ] 

Christoph Kiehl edited comment on JCR-974 at 6/19/07 9:13 AM:
--------------------------------------------------------------

This is a first patch which uses a FieldCache per index segment. To make this work we had to use our own implementation of FieldCache.StringIndex which does not keep an array of sort indexes for the document, but which keeps an array terms associated which each document. This of course uses more memory and there need to be some performance/scaling tests done.
We had to modify SearchIndex.CombinedIndexReader and CachingMultiReader to allow access to the underlying IndexReaders because those IndexReaders are used as cache keys in SharedFieldCache.
I'm not absolutely satisfied about this solution, because SharedFieldSortComparator has to know that there is a CombinedIndexReader and currently even assumes it.
Performance wise we achieved a speed up by factor 5-15 for queries sorting by some field in our current application. In our scenario we have got a lot of write operations and more than 1000000 nodes . For read-only repositories this patch slightly degrades performance by a factor of about 2.


 was:
This is a first patch which uses a FieldCache per index segment. To make this work we had to use our own implementation of FieldCache.StringIndex which does not keep an array of sort indexes for the document, but which keeps an array terms associated which each document. This of course uses more memory and there need to be some performance/scaling tests done.
We had to modify SearchIndex.CombinedIndexReader and CachingMultiReader to allow access to the underlying IndexReaders because those IndexReaders are used as cache keys in SharedFieldCache.
I'm not absolutely satisfied about this solution, because SharedFieldSortComparator has to know that there is a CombinedIndexReader and currently even assumes it.
Performance wise we achieved a speed up by factor 5-15 in our current application where we have got a lot of write operations and more than 1000000 nodes . For read-only repositories this patch slightly degrades performance by a factor of about 2.

> Manage Lucene FieldCaches per index segment
> -------------------------------------------
>
>                 Key: JCR-974
>                 URL: https://issues.apache.org/jira/browse/JCR-974
>             Project: Jackrabbit
>          Issue Type: Improvement
>          Components: query
>    Affects Versions: 1.3
>            Reporter: Christoph Kiehl
>         Attachments: patch.txt
>
>
> Jackrabbit uses an IndexSearcher which searches on a single IndexReader which is most likely to be an instance of CachingMultiReader. On every search that does sorting or range queries a FieldCache is populated and associated with this instance of a CachingMultiReader. On successive queries which operate on this CachingMultiReader you will get a tremendous speedup for queries which can reuse  those associated FieldCache instances.
> The problem is that Jackrabbit creates a new CachingMultiReader _everytime_ one of the underlying indexes are modified. This means if you just change _one_ item in the repository you will need to rebuild all those FieldCaches because the existing FieldCaches are associated with the old instance of CachingMultiReader.
> This does not only lead to slow search response times for queries which contains range queries or are sorted by a field but also leads to massive memory consumption (depending on the size of your indexes) because there might be multiple instances of CachingMultiReaders in use if you have a scenario where a lot of queries and item modifications are executed concurrently.
> The goal is to keep those FieldCaches as long as possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (JCR-974) Manage Lucene FieldCaches per index segment

Posted by "Marcel Reutegger (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/JCR-974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507153 ] 

Marcel Reutegger commented on JCR-974:
--------------------------------------

Thanks a lot for the revised patch. I was just going to write some comments about your previous version and now most of my concerns are already addressed ;)

One of the concerns I had was regarding lazy loading, because it would have required synchronization on the map (which was missing in the previous patch).

I'm using a fairly simple test case to measure performance. It involves creating 200'000 nodes with a LONG property set to distinct values and then executing the following query:

stuff//*[@foo] order by @foo

On average the execution time with the current codebase is 2200ms, with your initial patch: 265ms and with the latest patch: 235ms.

Can you please make sure you consistently use space characters instead of tabs in the source code? Thanks.

Now, there's just one thing left. You introduced a cache for the ReadOnlyIndexReaders in AbstractIndex. I'd rather not want to have a cache there because it means that we have to maintain it. In your patch the map is cleaned when the index is invalidated. For older index segments (the bigger ones, which resulted from index merges) this means that the map is only cleaned when the index segment is closed (when it is part of a merge or on shutdown). IMO this is somewhat of a memory leak and should be changed.

I would rather have these three lines at the beginning of the method SharedFieldCache.getStringIndex():

        if (reader instanceof ReadOnlyIndexReader) {
            reader = ((ReadOnlyIndexReader) reader).getBase();
        }

It doesn't win a price for beauty but has the same effect as the cache.

WDYT?

> Manage Lucene FieldCaches per index segment
> -------------------------------------------
>
>                 Key: JCR-974
>                 URL: https://issues.apache.org/jira/browse/JCR-974
>             Project: Jackrabbit
>          Issue Type: Improvement
>          Components: query
>    Affects Versions: 1.3
>            Reporter: Christoph Kiehl
>         Attachments: ItemStateManagerBasedSortComparator.patch, patch.txt, patch2.txt
>
>
> Jackrabbit uses an IndexSearcher which searches on a single IndexReader which is most likely to be an instance of CachingMultiReader. On every search that does sorting or range queries a FieldCache is populated and associated with this instance of a CachingMultiReader. On successive queries which operate on this CachingMultiReader you will get a tremendous speedup for queries which can reuse  those associated FieldCache instances.
> The problem is that Jackrabbit creates a new CachingMultiReader _everytime_ one of the underlying indexes are modified. This means if you just change _one_ item in the repository you will need to rebuild all those FieldCaches because the existing FieldCaches are associated with the old instance of CachingMultiReader.
> This does not only lead to slow search response times for queries which contains range queries or are sorted by a field but also leads to massive memory consumption (depending on the size of your indexes) because there might be multiple instances of CachingMultiReaders in use if you have a scenario where a lot of queries and item modifications are executed concurrently.
> The goal is to keep those FieldCaches as long as possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (JCR-974) Manage Lucene FieldCaches per index segment

Posted by "Marcel Reutegger (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JCR-974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marcel Reutegger resolved JCR-974.
----------------------------------

       Resolution: Fixed
    Fix Version/s: 1.4

Committed the latest patch in revision: 550429

Christoph, thanks a lot for your work.

> Manage Lucene FieldCaches per index segment
> -------------------------------------------
>
>                 Key: JCR-974
>                 URL: https://issues.apache.org/jira/browse/JCR-974
>             Project: Jackrabbit
>          Issue Type: Improvement
>          Components: query
>    Affects Versions: 1.3
>            Reporter: Christoph Kiehl
>             Fix For: 1.4
>
>         Attachments: ItemStateManagerBasedSortComparator.patch, patch.txt, patch2.txt, patch3.txt
>
>
> Jackrabbit uses an IndexSearcher which searches on a single IndexReader which is most likely to be an instance of CachingMultiReader. On every search that does sorting or range queries a FieldCache is populated and associated with this instance of a CachingMultiReader. On successive queries which operate on this CachingMultiReader you will get a tremendous speedup for queries which can reuse  those associated FieldCache instances.
> The problem is that Jackrabbit creates a new CachingMultiReader _everytime_ one of the underlying indexes are modified. This means if you just change _one_ item in the repository you will need to rebuild all those FieldCaches because the existing FieldCaches are associated with the old instance of CachingMultiReader.
> This does not only lead to slow search response times for queries which contains range queries or are sorted by a field but also leads to massive memory consumption (depending on the size of your indexes) because there might be multiple instances of CachingMultiReaders in use if you have a scenario where a lot of queries and item modifications are executed concurrently.
> The goal is to keep those FieldCaches as long as possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (JCR-974) Manage Lucene FieldCaches per index segment

Posted by "Christoph Kiehl (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JCR-974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christoph Kiehl updated JCR-974:
--------------------------------

    Attachment: patch.txt

This is a first patch which uses a FieldCache per index segment. To make this work we had to use our own implementation of FieldCache.StringIndex which does not keep an array of sort indexes for the document, but which keeps an array terms associated which each document. This of course uses more memory and there need to be some performance/scaling tests done.
We had to modify SearchIndex.CombinedIndexReader and CachingMultiReader to allow access to the underlying IndexReaders because those IndexReaders are used as cache keys in SharedFieldCache.
I'm not absolutely satisfied about this solution, because SharedFieldSortComparator has to know that there is a CombinedIndexReader and currently even assumes it.
Performance wise we achieved a speed up by factor 5-15 in our current application where we have got a lot of write operations and more than 1000000 nodes . For read-only repositories this patch slightly degrades performance by a factor of about 2.

> Manage Lucene FieldCaches per index segment
> -------------------------------------------
>
>                 Key: JCR-974
>                 URL: https://issues.apache.org/jira/browse/JCR-974
>             Project: Jackrabbit
>          Issue Type: Improvement
>          Components: query
>    Affects Versions: 1.3
>            Reporter: Christoph Kiehl
>         Attachments: patch.txt
>
>
> Jackrabbit uses an IndexSearcher which searches on a single IndexReader which is most likely to be an instance of CachingMultiReader. On every search that does sorting or range queries a FieldCache is populated and associated with this instance of a CachingMultiReader. On successive queries which operate on this CachingMultiReader you will get a tremendous speedup for queries which can reuse  those associated FieldCache instances.
> The problem is that Jackrabbit creates a new CachingMultiReader _everytime_ one of the underlying indexes are modified. This means if you just change _one_ item in the repository you will need to rebuild all those FieldCaches because the existing FieldCaches are associated with the old instance of CachingMultiReader.
> This does not only lead to slow search response times for queries which contains range queries or are sorted by a field but also leads to massive memory consumption (depending on the size of your indexes) because there might be multiple instances of CachingMultiReaders in use if you have a scenario where a lot of queries and item modifications are executed concurrently.
> The goal is to keep those FieldCaches as long as possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (JCR-974) Manage Lucene FieldCaches per index segment

Posted by "Marcel Reutegger (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JCR-974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marcel Reutegger updated JCR-974:
---------------------------------

    Attachment: ItemStateManagerBasedSortComparator.patch

Here's my attempt to keep FieldCaches per index reader.

Not well documented, but it's rather a prototype anyway.

WDYT?

> Manage Lucene FieldCaches per index segment
> -------------------------------------------
>
>                 Key: JCR-974
>                 URL: https://issues.apache.org/jira/browse/JCR-974
>             Project: Jackrabbit
>          Issue Type: Improvement
>          Components: query
>    Affects Versions: 1.3
>            Reporter: Christoph Kiehl
>         Attachments: ItemStateManagerBasedSortComparator.patch, patch.txt
>
>
> Jackrabbit uses an IndexSearcher which searches on a single IndexReader which is most likely to be an instance of CachingMultiReader. On every search that does sorting or range queries a FieldCache is populated and associated with this instance of a CachingMultiReader. On successive queries which operate on this CachingMultiReader you will get a tremendous speedup for queries which can reuse  those associated FieldCache instances.
> The problem is that Jackrabbit creates a new CachingMultiReader _everytime_ one of the underlying indexes are modified. This means if you just change _one_ item in the repository you will need to rebuild all those FieldCaches because the existing FieldCaches are associated with the old instance of CachingMultiReader.
> This does not only lead to slow search response times for queries which contains range queries or are sorted by a field but also leads to massive memory consumption (depending on the size of your indexes) because there might be multiple instances of CachingMultiReaders in use if you have a scenario where a lot of queries and item modifications are executed concurrently.
> The goal is to keep those FieldCaches as long as possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (JCR-974) Manage Lucene FieldCaches per index segment

Posted by "Marcel Reutegger (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/JCR-974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506497 ] 

Marcel Reutegger commented on JCR-974:
--------------------------------------

Do you have test cases or a description of the queries that you execute?

> Manage Lucene FieldCaches per index segment
> -------------------------------------------
>
>                 Key: JCR-974
>                 URL: https://issues.apache.org/jira/browse/JCR-974
>             Project: Jackrabbit
>          Issue Type: Improvement
>          Components: query
>    Affects Versions: 1.3
>            Reporter: Christoph Kiehl
>         Attachments: patch.txt
>
>
> Jackrabbit uses an IndexSearcher which searches on a single IndexReader which is most likely to be an instance of CachingMultiReader. On every search that does sorting or range queries a FieldCache is populated and associated with this instance of a CachingMultiReader. On successive queries which operate on this CachingMultiReader you will get a tremendous speedup for queries which can reuse  those associated FieldCache instances.
> The problem is that Jackrabbit creates a new CachingMultiReader _everytime_ one of the underlying indexes are modified. This means if you just change _one_ item in the repository you will need to rebuild all those FieldCaches because the existing FieldCaches are associated with the old instance of CachingMultiReader.
> This does not only lead to slow search response times for queries which contains range queries or are sorted by a field but also leads to massive memory consumption (depending on the size of your indexes) because there might be multiple instances of CachingMultiReaders in use if you have a scenario where a lot of queries and item modifications are executed concurrently.
> The goal is to keep those FieldCaches as long as possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (JCR-974) Manage Lucene FieldCaches per index segment

Posted by "Christoph Kiehl (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/JCR-974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506599 ] 

Christoph Kiehl commented on JCR-974:
-------------------------------------

I tried building a test case, but you need a fairly large index to really see the benefitsof my patch. In our production environment the workspace index is 500MB in size and the jcr:system index is about 1200MB (and both of course still growing). With indexes as big as that the effect of the operation systems file system cache is not as big as in small test cases. In my small test case the performance with my patch was a bit worse for repeating queries on an unchanged repository.
I think we should provide a little tool that takes the wikipedia content an puts it all into a test repository which could then be used for such test cases. What do you think?

> Manage Lucene FieldCaches per index segment
> -------------------------------------------
>
>                 Key: JCR-974
>                 URL: https://issues.apache.org/jira/browse/JCR-974
>             Project: Jackrabbit
>          Issue Type: Improvement
>          Components: query
>    Affects Versions: 1.3
>            Reporter: Christoph Kiehl
>         Attachments: ItemStateManagerBasedSortComparator.patch, patch.txt, patch2.txt
>
>
> Jackrabbit uses an IndexSearcher which searches on a single IndexReader which is most likely to be an instance of CachingMultiReader. On every search that does sorting or range queries a FieldCache is populated and associated with this instance of a CachingMultiReader. On successive queries which operate on this CachingMultiReader you will get a tremendous speedup for queries which can reuse  those associated FieldCache instances.
> The problem is that Jackrabbit creates a new CachingMultiReader _everytime_ one of the underlying indexes are modified. This means if you just change _one_ item in the repository you will need to rebuild all those FieldCaches because the existing FieldCaches are associated with the old instance of CachingMultiReader.
> This does not only lead to slow search response times for queries which contains range queries or are sorted by a field but also leads to massive memory consumption (depending on the size of your indexes) because there might be multiple instances of CachingMultiReaders in use if you have a scenario where a lot of queries and item modifications are executed concurrently.
> The goal is to keep those FieldCaches as long as possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (JCR-974) Manage Lucene FieldCaches per index segment

Posted by "Marcel Reutegger (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/JCR-974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506499 ] 

Marcel Reutegger commented on JCR-974:
--------------------------------------

btw, both SearchIndex.CombinedIndexReader and CachingMultiReader implement MultiIndexReader which exposes getIndexReaders().

> Manage Lucene FieldCaches per index segment
> -------------------------------------------
>
>                 Key: JCR-974
>                 URL: https://issues.apache.org/jira/browse/JCR-974
>             Project: Jackrabbit
>          Issue Type: Improvement
>          Components: query
>    Affects Versions: 1.3
>            Reporter: Christoph Kiehl
>         Attachments: ItemStateManagerBasedSortComparator.patch, patch.txt
>
>
> Jackrabbit uses an IndexSearcher which searches on a single IndexReader which is most likely to be an instance of CachingMultiReader. On every search that does sorting or range queries a FieldCache is populated and associated with this instance of a CachingMultiReader. On successive queries which operate on this CachingMultiReader you will get a tremendous speedup for queries which can reuse  those associated FieldCache instances.
> The problem is that Jackrabbit creates a new CachingMultiReader _everytime_ one of the underlying indexes are modified. This means if you just change _one_ item in the repository you will need to rebuild all those FieldCaches because the existing FieldCaches are associated with the old instance of CachingMultiReader.
> This does not only lead to slow search response times for queries which contains range queries or are sorted by a field but also leads to massive memory consumption (depending on the size of your indexes) because there might be multiple instances of CachingMultiReaders in use if you have a scenario where a lot of queries and item modifications are executed concurrently.
> The goal is to keep those FieldCaches as long as possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (JCR-974) Manage Lucene FieldCaches per index segment

Posted by "Christoph Kiehl (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/JCR-974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506501 ] 

Christoph Kiehl commented on JCR-974:
-------------------------------------

The query I'm doing my tests with looks like this:

//element(*, app-mix:document) order by @app:modificationDate

Unfortunately I've got no testcase yet. I'll try to create one.

> Manage Lucene FieldCaches per index segment
> -------------------------------------------
>
>                 Key: JCR-974
>                 URL: https://issues.apache.org/jira/browse/JCR-974
>             Project: Jackrabbit
>          Issue Type: Improvement
>          Components: query
>    Affects Versions: 1.3
>            Reporter: Christoph Kiehl
>         Attachments: ItemStateManagerBasedSortComparator.patch, patch.txt
>
>
> Jackrabbit uses an IndexSearcher which searches on a single IndexReader which is most likely to be an instance of CachingMultiReader. On every search that does sorting or range queries a FieldCache is populated and associated with this instance of a CachingMultiReader. On successive queries which operate on this CachingMultiReader you will get a tremendous speedup for queries which can reuse  those associated FieldCache instances.
> The problem is that Jackrabbit creates a new CachingMultiReader _everytime_ one of the underlying indexes are modified. This means if you just change _one_ item in the repository you will need to rebuild all those FieldCaches because the existing FieldCaches are associated with the old instance of CachingMultiReader.
> This does not only lead to slow search response times for queries which contains range queries or are sorted by a field but also leads to massive memory consumption (depending on the size of your indexes) because there might be multiple instances of CachingMultiReaders in use if you have a scenario where a lot of queries and item modifications are executed concurrently.
> The goal is to keep those FieldCaches as long as possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (JCR-974) Manage Lucene FieldCaches per index segment

Posted by "Christoph Kiehl (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JCR-974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christoph Kiehl updated JCR-974:
--------------------------------

    Attachment: patch2.txt

This is a revised version of the first patch. The following changes were applied:

- Using MultiIndexReader interface instead of providing own methods on CombinedIndexReader and CachingMultiReader. This is not only better design but also improves performance a bit.
- Create caches proactive instead of lazily and use an array to access them. This improves performance a little bit for successive queries.

> Manage Lucene FieldCaches per index segment
> -------------------------------------------
>
>                 Key: JCR-974
>                 URL: https://issues.apache.org/jira/browse/JCR-974
>             Project: Jackrabbit
>          Issue Type: Improvement
>          Components: query
>    Affects Versions: 1.3
>            Reporter: Christoph Kiehl
>         Attachments: ItemStateManagerBasedSortComparator.patch, patch.txt, patch2.txt
>
>
> Jackrabbit uses an IndexSearcher which searches on a single IndexReader which is most likely to be an instance of CachingMultiReader. On every search that does sorting or range queries a FieldCache is populated and associated with this instance of a CachingMultiReader. On successive queries which operate on this CachingMultiReader you will get a tremendous speedup for queries which can reuse  those associated FieldCache instances.
> The problem is that Jackrabbit creates a new CachingMultiReader _everytime_ one of the underlying indexes are modified. This means if you just change _one_ item in the repository you will need to rebuild all those FieldCaches because the existing FieldCaches are associated with the old instance of CachingMultiReader.
> This does not only lead to slow search response times for queries which contains range queries or are sorted by a field but also leads to massive memory consumption (depending on the size of your indexes) because there might be multiple instances of CachingMultiReaders in use if you have a scenario where a lot of queries and item modifications are executed concurrently.
> The goal is to keep those FieldCaches as long as possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (JCR-974) Manage Lucene FieldCaches per index segment

Posted by "Christoph Kiehl (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JCR-974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christoph Kiehl updated JCR-974:
--------------------------------

    Attachment: patch3.txt

Patch3 incorporates changes as suggested. ReadOnlyIndexReaders are no longer cached in AbstractIndex. 

Thanks for taking the time to review the patches.

> Manage Lucene FieldCaches per index segment
> -------------------------------------------
>
>                 Key: JCR-974
>                 URL: https://issues.apache.org/jira/browse/JCR-974
>             Project: Jackrabbit
>          Issue Type: Improvement
>          Components: query
>    Affects Versions: 1.3
>            Reporter: Christoph Kiehl
>         Attachments: ItemStateManagerBasedSortComparator.patch, patch.txt, patch2.txt, patch3.txt
>
>
> Jackrabbit uses an IndexSearcher which searches on a single IndexReader which is most likely to be an instance of CachingMultiReader. On every search that does sorting or range queries a FieldCache is populated and associated with this instance of a CachingMultiReader. On successive queries which operate on this CachingMultiReader you will get a tremendous speedup for queries which can reuse  those associated FieldCache instances.
> The problem is that Jackrabbit creates a new CachingMultiReader _everytime_ one of the underlying indexes are modified. This means if you just change _one_ item in the repository you will need to rebuild all those FieldCaches because the existing FieldCaches are associated with the old instance of CachingMultiReader.
> This does not only lead to slow search response times for queries which contains range queries or are sorted by a field but also leads to massive memory consumption (depending on the size of your indexes) because there might be multiple instances of CachingMultiReaders in use if you have a scenario where a lot of queries and item modifications are executed concurrently.
> The goal is to keep those FieldCaches as long as possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (JCR-974) Manage Lucene FieldCaches per index segment

Posted by "Christoph Kiehl (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/JCR-974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506506 ] 

Christoph Kiehl commented on JCR-974:
-------------------------------------

Regarding your ItemStateManagerBasedSortComparator.patch: This patch doesn't work well in our scenario because we've got fairly large resultsets. I think your patch might handle small result sets better than my patch, but for large result sets there are too many documents from different index segments. Using your patch my query takes about 100000ms while using our patch it needs between 200ms and 1000ms.

One of the other features of my patch is that it creates the caches lazily per index segment. We also played around with a global term cache so if the same term is returned by different index segments the same String object is used for the FieldCache. This minimizes the FieldCache size if one term is contained in multiple index segments. In our case the default FieldCache was about 4MB for a certain field while the patched FieldCache was about 2.5MB.

> Manage Lucene FieldCaches per index segment
> -------------------------------------------
>
>                 Key: JCR-974
>                 URL: https://issues.apache.org/jira/browse/JCR-974
>             Project: Jackrabbit
>          Issue Type: Improvement
>          Components: query
>    Affects Versions: 1.3
>            Reporter: Christoph Kiehl
>         Attachments: ItemStateManagerBasedSortComparator.patch, patch.txt
>
>
> Jackrabbit uses an IndexSearcher which searches on a single IndexReader which is most likely to be an instance of CachingMultiReader. On every search that does sorting or range queries a FieldCache is populated and associated with this instance of a CachingMultiReader. On successive queries which operate on this CachingMultiReader you will get a tremendous speedup for queries which can reuse  those associated FieldCache instances.
> The problem is that Jackrabbit creates a new CachingMultiReader _everytime_ one of the underlying indexes are modified. This means if you just change _one_ item in the repository you will need to rebuild all those FieldCaches because the existing FieldCaches are associated with the old instance of CachingMultiReader.
> This does not only lead to slow search response times for queries which contains range queries or are sorted by a field but also leads to massive memory consumption (depending on the size of your indexes) because there might be multiple instances of CachingMultiReaders in use if you have a scenario where a lot of queries and item modifications are executed concurrently.
> The goal is to keep those FieldCaches as long as possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.