You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by thomasg <th...@hotmail.com> on 2006/03/28 12:06:54 UTC

Lucene Performance Issues

Hi, we are currently intending to implement a document storage / search tool
using Jackrabbit and Lucene. We have been approached by a commercial search
and indexing organisation called ISYS who are suggesting the following
problems with using Lucene. We do have a requirement to store and search
large documents and the total document store will be large too. Any comments
on the following would be greatly appreciated.

1) By default, Lucene only indexes the first 10,000 words from each
document. When increasing this default out-of-memory errors can occur. This
implies that documents, or large sections thereof, are loaded into memory.
ISYS has a very small memory footprint which is not affected by document
size nor number of documents.

 
2) Lucene appears to be slow at indexing, at least by ISYS' standards.
Published performance benchmarks seem to vary between almost acceptable,
down to very poor. ISYS' file readers are already optimized for the fastest
text extraction possible.

 
3) The Lucene documentation suggests it can be slow at searching and can get
slower and slower the larger your indexes get. The tipping point is where
the index size exceeds the amount of free memory in your machine. This also
implies that whole indexes, or large portions of them, are loaded into
memory. The bigger the index, the more powerful the machine required. ISYS'
search speed is always proportional to the size of the result set. Index
size does not materially affect search speed and the index is never loaded
into memory. It also appears that Lucene requires hands-on tuning to keep
its search speed acceptable. ISYS' indexes are self-managing and do not
require any maintenance to keep them searchable at full speed.


Thanks, Thomas
--
View this message in context: http://www.nabble.com/Lucene-Performance-Issues-t1354811.html#a3626992
Sent from the Lucene - Java Users forum at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene Performance Issues

Posted by thomasg <th...@hotmail.com>.
Thanks v. much for your thoughts, a lot to think about. I'm currently doing
some benchmark tests on typical usage scenarios with lucene. I'm actually
using lucene through its integration with Jackrabbit dms so may not be
easy/possible to use a different search engine anyway. Of course I'd rather
be developing with open source tools but the project manager thought it good
to ask around. Thanks again and will get some benchmarks that will mean more
than guesswork.
--
View this message in context: http://www.nabble.com/Lucene-Performance-Issues-t1354811.html#a3633569
Sent from the Lucene - Java Users forum at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene Performance Issues

Posted by Doug Cutting <cu...@apache.org>.
thomasg wrote:
> Hi, we are currently intending to implement a document storage / search tool
> using Jackrabbit and Lucene. We have been approached by a commercial search
> and indexing organisation called ISYS who are suggesting the following
> problems with using Lucene. We do have a requirement to store and search
> large documents and the total document store will be large too. Any comments
> on the following would be greatly appreciated.

I would tend to be sceptical of claims from someone who is trying to 
sell you something: they are inherently biased.  The best way to answer 
performance questions is to benchmark.  That way you'll also get 
experience with each alternative.

Ease of development, availability of support, ease of debugging, cost, 
and other aspects besides raw performance may also prove more important 
long-term.

> 1) By default, Lucene only indexes the first 10,000 words from each
> document. When increasing this default out-of-memory errors can occur. This
> implies that documents, or large sections thereof, are loaded into memory.
> ISYS has a very small memory footprint which is not affected by document
> size nor number of documents.

Lucene does store a few bytes per word indexed while indexing.  But I've 
never seen folks complain about this as a significant consumer of 
memory, even when they crank up the limit.  How large are your large 
documents?  (Is it really useful to get hits on such large documents 
anyway, or might it be more useful to get hits on sections of these?)

> 2) Lucene appears to be slow at indexing, at least by ISYS' standards.
> Published performance benchmarks seem to vary between almost acceptable,
> down to very poor. ISYS' file readers are already optimized for the fastest
> text extraction possible.

How fast do you need to index?  How fast does your collection change? 
In most systems I've worked on, Lucene's indexing speed has not been a 
bottleneck.  For example, when crawling with Nutch, downloading pages 
and html parsing are typically much slower than indexing.  Lucene does 
not include text extractors, so it's hard to know what ISIS is comparing 
to there.  Also, when benchmarking indexing, one should look at the rate 
when the index is very large, not just when the index is small.

> 3) The Lucene documentation suggests it can be slow at searching and can get
> slower and slower the larger your indexes get. The tipping point is where
> the index size exceeds the amount of free memory in your machine. This also
> implies that whole indexes, or large portions of them, are loaded into
> memory.

Lucene does not normally store entire indexes in memory.  Indexes much 
larger than memory can be searched quickly.  For very high traffic (tens 
to hundreds of searches per second) indexes smaller than memory are 
required, since otherwise the disk becomes a bottleneck.  My rule of 
thumb is that a 10M document index of 10k documents (100G of text) 
easily gives sub-second average response time for typical queries.

> The bigger the index, the more powerful the machine required. ISYS'
> search speed is always proportional to the size of the result set.

That sound suspicious to me, unless by "size of result set" they mean 
the total number of documents that include any query term, in which case 
that is also true of Lucene.  But if you're using an "AND" query, the 
size of the result set is usually much smaller than that total.

Do they claim they can search billions of documents with a boolean query 
that only matches a few documents in time proportional to the size of 
those few matching documents?  And all that on a 486 with 1MB of RAM? 
That sounds like magic!

> Index
> size does not materially affect search speed and the index is never loaded
> into memory.

Index size does not affect Lucene's search speed per-se: what matters is 
the frequency of the search terms.  And terms tend to have larger 
frequencies in larger indexes.

> It also appears that Lucene requires hands-on tuning to keep
> its search speed acceptable.

I don't recommend tuning Lucene, but rather just using the out-of-the 
box defaults.  Folks tend to cause problems for themselves trying to 
make things 5% faster by setting parameters to extreme values without 
understanding the consequences of those parameters.  This results in 
lots of tuning-related messages on the mailing lists, but I am not 
convinced many of these are required.

> ISYS' indexes are self-managing and do not
> require any maintenance to keep them searchable at full speed.

Lucene could make the same claim.

I suspect you can achieve adequate performance with either product.  The 
  more salient difference will be that with Lucene you'd have the full 
source code and mailing lists like this to help you.  With ISYS you'd 
have a black box, a detailed manual, and a phone number.  So it depends 
on your style of development: do you want to be able to peek inside the 
search-engine, or do you want to be able to pick up the phone?

Cheers,

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene Performance Issues

Posted by Eric Jain <Er...@isb-sib.ch>.
thomasg wrote:
> 1) By default, Lucene only indexes the first 10,000 words from each
> document. When increasing this default out-of-memory errors can occur. This
> implies that documents, or large sections thereof, are loaded into memory.
> ISYS has a very small memory footprint which is not affected by document
> size nor number of documents.

As far as I know, documents do indeed have to be built in memory prior to 
indexing. But this shouldn't be a problem unless you have only a few 
megabytes of memory, or you have documents that are hundreds of megabytes 
large -- and such large documents should probably be split, anyway.


> 2) Lucene appears to be slow at indexing, at least by ISYS' standards.
> Published performance benchmarks seem to vary between almost acceptable,
> down to very poor. ISYS' file readers are already optimized for the fastest
> text extraction possible.

Indexing performance is my main concern with Lucene, though there are 
several parameters that can be tuned and I haven't exhausted all of them yet...

Currently I am using:

   writer.setMergeFactor(100);
   writer.setMaxBufferedDocs(100);
   writer.setUseCompoundFile(false);

This allows me to build a 3GB index with about 3M documents in 6h on a 
2x2GHz Intel Xeon machine with 1GB of memory and a reasonably fast hard 
disk. There is some other stuff going on besides the indexing, but the 
indexing does seem to take up the greatest amount of time.

Note that Lucene also supports incremental updates.


> 3) The Lucene documentation suggests it can be slow at searching and can get
> slower and slower the larger your indexes get. The tipping point is where
> the index size exceeds the amount of free memory in your machine. This also
> implies that whole indexes, or large portions of them, are loaded into
> memory. The bigger the index, the more powerful the machine required. ISYS'
> search speed is always proportional to the size of the result set. Index
> size does not materially affect search speed and the index is never loaded
> into memory. It also appears that Lucene requires hands-on tuning to keep
> its search speed acceptable. ISYS' indexes are self-managing and do not
> require any maintenance to keep them searchable at full speed.

Queries on the index mentioned above return results within a few 
milliseconds, with less than 256MB used by the VM, though some complex 
queries that contain a lot of frequent terms may take up to several 
seconds. I'm not sure how Lucene's searching performance can be tuned, but 
haven't bother to do so as it hasn't been a bottleneck, so far...

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene Performance Issues

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi Thomas,

Sound like FUD to me.  No concrete numbers, and the benchmark they mention.... eh, haven't we all seen "funny" benchmarks before?  Lucene is used in many large operations (e.g. Technorati, Simpy) that involve a LOT of indexing and searching, large indices, etc.  I suggest you try both and see which one suits your needs. 

Otis

----- Original Message ----
From: thomasg <th...@hotmail.com>
To: java-user@lucene.apache.org
Sent: Tuesday, March 28, 2006 5:06:54 AM
Subject: Lucene Performance Issues


Hi, we are currently intending to implement a document storage / search tool
using Jackrabbit and Lucene. We have been approached by a commercial search
and indexing organisation called ISYS who are suggesting the following
problems with using Lucene. We do have a requirement to store and search
large documents and the total document store will be large too. Any comments
on the following would be greatly appreciated.

1) By default, Lucene only indexes the first 10,000 words from each
document. When increasing this default out-of-memory errors can occur. This
implies that documents, or large sections thereof, are loaded into memory.
ISYS has a very small memory footprint which is not affected by document
size nor number of documents.

 
2) Lucene appears to be slow at indexing, at least by ISYS' standards.
Published performance benchmarks seem to vary between almost acceptable,
down to very poor. ISYS' file readers are already optimized for the fastest
text extraction possible.

 
3) The Lucene documentation suggests it can be slow at searching and can get
slower and slower the larger your indexes get. The tipping point is where
the index size exceeds the amount of free memory in your machine. This also
implies that whole indexes, or large portions of them, are loaded into
memory. The bigger the index, the more powerful the machine required. ISYS'
search speed is always proportional to the size of the result set. Index
size does not materially affect search speed and the index is never loaded
into memory. It also appears that Lucene requires hands-on tuning to keep
its search speed acceptable. ISYS' indexes are self-managing and do not
require any maintenance to keep them searchable at full speed.


Thanks, Thomas
--
View this message in context: http://www.nabble.com/Lucene-Performance-Issues-t1354811.html#a3626992
Sent from the Lucene - Java Users forum at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org