You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael Busch (JIRA)" <ji...@apache.org> on 2008/05/21 10:37:55 UTC

[jira] Updated: (LUCENE-1195) Performance improvement for TermInfosReader

     [ https://issues.apache.org/jira/browse/LUCENE-1195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Busch updated LUCENE-1195:
----------------------------------

    Attachment: lucene-1195.patch

Changes in the patch:
- the used cache is thread-safe now
- added a ThreadLocal to TermInfosReader, so that each thread has its own cache of size 1024 now
- SegmentTermEnum.scanTo() returns now the number of invocations of next(). TermInfosReader only
  puts TermInfo objects into the cache if scanTo() has called next() more than once. Thus, if e. g.
  a WildcardQuery or RangeQuery iterates over terms in order, only the first term will be put into
  the cache. This is an addition to the ThreadLocal that prevents one thread from wiping out its
  own cache with such a query. 
- added a new package org/apache/lucene/util/cache that has a SimpleMapCache (taken from LUCENE-831)
  and the SimpleLRUCache that was part of the previous patch here. I decided to put the caches in
  a separate package, because we can reuse them for different things like LUCENE-831 or e. g. after
  deprecating Hits as LRU cache for recently loaded stored documents.
  
I reran the same performance experiments and it turns out that the speedup is still the same and
the overhead of the ThreadLocal is in the noise. So I think this should be a good approach now?

I also ran similar performance tests on a bigger index with about 4.3 million documents. The 
speedup with 50k AND queries was, as expected, not as significant anymore. However, the speedup
was still about 7%. I haven't run the OR queries on the bigger index yet, but most likely the
speedup will not be very significant anymore.

All unit tests pass.

> Performance improvement for TermInfosReader
> -------------------------------------------
>
>                 Key: LUCENE-1195
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1195
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: lucene-1195.patch, lucene-1195.patch
>
>
> Currently we have a bottleneck for multi-term queries: the dictionary lookup is being done
> twice for each term. The first time in Similarity.idf(), where searcher.docFreq() is called.
> The second time when the posting list is opened (TermDocs or TermPositions).
> The dictionary lookup is not cheap, that's why a significant performance improvement is
> possible here if we avoid the second lookup. An easy way to do this is to add a small LRU 
> cache to TermInfosReader. 
> I ran some performance experiments with an LRU cache size of 20, and an mid-size index of
> 500,000 documents from wikipedia. Here are some test results:
> 50,000 AND queries with 3 terms each:
> old:                  152 secs
> new (with LRU cache): 112 secs (26% faster)
> 50,000 OR queries with 3 terms each:
> old:                  175 secs
> new (with LRU cache): 133 secs (24% faster)
> For bigger indexes this patch will probably have less impact, for smaller once more.
> I will attach a patch soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org