You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Vincent Le Maout <vi...@lingway.com> on 2004/07/26 17:34:29 UTC

over 300 GB to index: feasability and performance issue

Hi everyone,

I have to index a huge, huge amount of data: about 10 million documents
making up about 300 GB. Is there any technical limitation in Lucene that
could prevent me from processing such amount (I mean, of course, apart
from the external limits induce by the hardware: RAM, disks, the system,
whatever) ? If possible, does anyone have an idea of the amount of resource
needed: RAM, CPU time, size of indexes, access time on such a collection ?
if not, is it possible to extrapolate an estimation from previous 
benchmarks ?

Thanks in advance.
Regards.

Vincent Le Maout

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: over 300 GB to index: feasability and performance issue

Posted by Doug Cutting <cu...@apache.org>.
Vincent Le Maout wrote:
> I have to index a huge, huge amount of data: about 10 million documents
> making up about 300 GB. Is there any technical limitation in Lucene that
> could prevent me from processing such amount (I mean, of course, apart
> from the external limits induce by the hardware: RAM, disks, the system,
> whatever) ?

Lucene is in theory able to support up to 2B documents in a single 
index.  Folks have sucessfully built indexes with several hundred 
million documents.  10 million should not be a problem.

> If possible, does anyone have an idea of the amount of resource
> needed: RAM, CPU time, size of indexes, access time on such a collection ?
> if not, is it possible to extrapolate an estimation from previous 
> benchmarks ?

For simple 2-3 term queries, with average sized documents (~10k of text) 
you should get decent performance (1 second / query) on a 10M document 
index.  An index typically requires around 35% of the plain text size.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org