You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Robert Engels <re...@ix.netcom.com> on 2006/06/27 18:26:52 UTC
RE: caching term information?

Doug,

If possible, I would like your comments on the following:

It seems the following factors influence the performance of Lucene.

1. determine what information needs to be read from the index
2. reading compressed information from disk
3. decompressing the information
4. search the information
5. garbage collection overhead

To address each performance factor:

1. The current index scheme seems efficient. There may be better algorithms
than the current binary search based on actual usage, but the current
solution is a good general purpose one.

2. The OS disk cache does what is can here. Better compression would lead to
less disk IO, but you might pay a CPU cost in #3. Increasing memory
available to the disk cache minimizes this factor. Moving some/all of the
compressed information into a local cache would avoid OS calls and context
switches, but the GC overhead might actually reduce performance.

3. A more efficient algorithm would reduce this, but the current Lucene
algorithm is fairly simple and high performing. A potential solution would
be cache decompressed pages of information. For example, I believe
decompressing and searching Vints is orders of magnitude slower than
searching an array of int. There is a memory usage vs. CPU cost trade-off
here.

4. Once the information is in a decompressed state, the query evaluation
time is the limiting factor. The current boolean scorers with skipTo() seem
to optimize this as best as possible (sorting based on frequency). An
optimization I can see here that may work in some environments is to stop
the query at some point and then just filter the results. This is similar to
the current 'convert term to filter' optimization. The problem is that for a
highly interactive index, the filter recomputation costs becomes quite high
(and caching a multitude of filters can cost memory). An alternative
solution would be to store all fields, and then have the filter read the
documents to filter them. Ideally the query evaluation could stop evaluating
the query once the potential matching documents fell below a certain
threshold, and then convert the remaining clauses into these types of
filters (rather than a potentially uncached range filter).

5. I don't think this can be overlooked. The decompression, especially for
Strings creates a lot of garbage. The other 'heavily' used methods should be
evaluated for object reuse or primitives. I think an optimization here might
be some sort of reusable array  in a ThreadLocal for Strings. I haven't
thought about this extensively. A Lucene server under heavy load eventually
starts competing with the GC.

As a summary, in my profiling, Lucene is heavily CPU bound in performing
small searches - and it seems to be the decompression time. Lucene is
heavily IO bound when reading documents, but this is to be expected, and the
OS cache is large enough helps significantly for commonly used documents.



-----Original Message-----
From: Doug Cutting [mailto:cutting@apache.org] 
Sent: Monday, May 22, 2006 5:48 PM
To: java-dev@lucene.apache.org
Subject: Re: caching term information?

Robert Engels wrote:
> I was amazed at how much time is spent in both readVint and readByte().
> Seems high, but I think it is mainly due to the number of invocations.

Profilers have been known to exaggerate this sort of thing.  These are
central routines of Lucene, but they're also pretty simple and hard to make
a lot faster.

> 1) What if BufferedIndexInput had an optimized version of readVint 
> that used the buffer and manipulated the position directly?

Give it a try and see if it's much faster.  Sun's JVMs are pretty smart
these days, and such micro-optimizations are proving less likely to improve
things than they used to be.  Also, we don't want to tune things too highly
for any given JVM, so it would have to be substantially faster to warrant
committing something like this.

> 2) Instead of caching the TermInfo, what if the TermDocs were cached 
> (again for the top 20% terms). The memory requirement would be much 
> greater, but you could also say "do not cache the TermDocs that had 
> more than X documents". The optimized searcher already converts 
> TermQueries similar to this to a Filter anyway.

The majority of query time is typically spent processing terms that occur in
lots of documents.  Terms that occur in only few documents are faster to
process, so speeding them doesn't affect overall performance as much as one
might hope.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org