You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by "Marcel Reutegger (JIRA)" <ji...@apache.org> on 2007/03/16 12:14:09 UTC

[jira] Commented: (JCR-791) Cache BitSet per IndexReader in MatchAllScorer

    [ https://issues.apache.org/jira/browse/JCR-791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12481569 ] 

Marcel Reutegger commented on JCR-791:
--------------------------------------

Here's what I've done so far:

- Introduced a MultiIndexReader interface that allows to access the sub index readers.
- CachingMultiReader and SearchIndex.CombinedIndexReader now implement MultiIndexReader
- Created a MultiScorer which spans multiple sub scorers and combines. The MultiScorer exposes the sub scorers as if there is just a single scorer.
- Changed MatchAllWeight to create individual scorers for each sub IndexReader contained in a MultiIndexReader and finally combines them into a MultiScorer.
- Introduced a BitSet cache in MatchAllScorer

I then conducted the following tests:

Setup:

- 50'000 nodes
- resultFetchSize: 50
- respectDocumentOrder: false

100 queries: //element(*, nt:unstructured)[@foo]
 (only size of NodeIterator is read, no node access)

Results:

1) with jackrabbit 1.2.3:
    82078 ms

2) with MatchAllScorer per index segment
  combined with MultiScorer without caching:
    10297 ms

3) with MatchAllScorer per index segment
  combined with MultiScorer with caching:
     6156 ms

My conclusion is that the the lucene MultiTermDocs implementation adds significant cost when a single MatchAllScorer is used in test scenario 1). And it actually makes sense. If a single MatchAllScorer is used, lucene has to merge sort the @foo terms of several index segments, while in the test scenarios 2) and 3) no merge sort is needed for the @foo terms.

With the changes the query performance even seems good enough even without caching. 

I'm tempted to only check the changes without caching because the additional performance improvement with caching does not seem to warrant the memory consumption of the cache: 2) decreases the query time compared to the current implementation by 87% while 3) decreases query time by 92%.

> Cache BitSet per IndexReader in MatchAllScorer
> ----------------------------------------------
>
>                 Key: JCR-791
>                 URL: https://issues.apache.org/jira/browse/JCR-791
>             Project: Jackrabbit
>          Issue Type: Improvement
>          Components: query
>            Reporter: Marcel Reutegger
>            Priority: Minor
>
> The BitSets created in MatchAllScorer should be cached per IndexReader. This enhancement should also take care that the supplied IndexReader may in fact be a CombinedIndexReader or a CachingMultiReader with multiple contained IndexReaders. To achieve a good cache efficiency the BitSets must be cached per contained IndexReader and combined later.
> See also thread on dev list: http://thread.gmane.org/gmane.comp.apache.jackrabbit.devel/10976

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: Commented: (JCR-791) Cache BitSet per IndexReader in MatchAllScorer

Posted by Marcel Reutegger <ma...@gmx.net>.
Hi Christoph,

Christoph Kiehl wrote:
> The effect of caching should increase if you use queries which test an 
> attribute more than once, like:
> 
> //element(*, nt:unstructured)[@foo!='1' or @foo!='2' or @foo!='3']
> 
> May be we can add a configuration option to SearchIndex which allows to 
> enable caching? This, way one can choose if his/her focus is on memory 
> or on processing time. We have a situation for example where a lot of 
> memory is available but processing time is the bottleneck.

I agree. and it makes a lot of sense to cache the BitSet at least for the time 
of the query execution. I personally try to avoid caches, unless the performance 
improvement is significant. With every cache we introduce we need to make sure 
that it does not use up all memory. The simple cache in MatchAllScorer I used 
for testing does not take care of that. If we introduce a cache there we would 
have to add some more code to integrate it into the CacheManager.

> Would you mind sharing a patch for the caching you implemented? Do you 
> may be even have a testcase which generates this test repository? I 
> could do some further tests here ..

I'll them to the jira issue, once jira is up and running again...

regards
  marcel

Re: Commented: (JCR-791) Cache BitSet per IndexReader in MatchAllScorer

Posted by Christoph Kiehl <ch...@sulu3000.de>.
Marcel Reutegger (JIRA) wrote:

> Here's what I've done so far:
> 
> - Introduced a MultiIndexReader interface that allows to access the sub index readers.
> - CachingMultiReader and SearchIndex.CombinedIndexReader now implement MultiIndexReader
> - Created a MultiScorer which spans multiple sub scorers and combines. The MultiScorer exposes the sub scorers as if there is just a single scorer.
> - Changed MatchAllWeight to create individual scorers for each sub IndexReader contained in a MultiIndexReader and finally combines them into a MultiScorer.
> - Introduced a BitSet cache in MatchAllScorer

Great. Thanks a lot!

> I then conducted the following tests:
> 
> Setup:
> 
> - 50'000 nodes
> - resultFetchSize: 50
> - respectDocumentOrder: false
> 
> 100 queries: //element(*, nt:unstructured)[@foo]
>  (only size of NodeIterator is read, no node access)
> 
> Results:
> 
> 1) with jackrabbit 1.2.3:
>     82078 ms
> 
> 2) with MatchAllScorer per index segment
>   combined with MultiScorer without caching:
>     10297 ms
> 
> 3) with MatchAllScorer per index segment
>   combined with MultiScorer with caching:
>      6156 ms
> 
> My conclusion is that the the lucene MultiTermDocs implementation adds significant cost when a single MatchAllScorer is used in test scenario 1). And it actually makes sense. If a single MatchAllScorer is used, lucene has to merge sort the @foo terms of several index segments, while in the test scenarios 2) and 3) no merge sort is needed for the @foo terms.
> 
> With the changes the query performance even seems good enough even without caching. 
> 
> I'm tempted to only check the changes without caching because the additional performance improvement with caching does not seem to warrant the memory consumption of the cache: 2) decreases the query time compared to the current implementation by 87% while 3) decreases query time by 92%.

The effect of caching should increase if you use queries which test an attribute 
more than once, like:

//element(*, nt:unstructured)[@foo!='1' or @foo!='2' or @foo!='3']

May be we can add a configuration option to SearchIndex which allows to enable 
caching? This, way one can choose if his/her focus is on memory or on processing 
time. We have a situation for example where a lot of memory is available but 
processing time is the bottleneck.

Would you mind sharing a patch for the caching you implemented? Do you may be 
even have a testcase which generates this test repository? I could do some 
further tests here ..

Cheers,
Christoph