You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Jeff Munson <jm...@newspaperarchive.com> on 2004/10/08 14:43:30 UTC

Indexing Strategy for 20 million documents

I am a new user of Lucene.  I am looking to index over 20 million
documents (and a lot more someday) and am looking for ideas on the best
indexing/search strategy.  

Which will optimize the Lucene search, one index or multiple indexes?
Do I create multiple indexes and merge them all together?  Or do I
create multiple indexes and search on the multiple indexes?  

Any helpful ideas would be appreciated!

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Indexing Strategy for 20 million documents

Posted by Justin Swanhart <gr...@gmail.com>.
It depends on a lot of factors.  I myself use multiple indexes for
about 10M documents.
My documents are transient.  Each day I get about 400K and I remove
about 400K.  I
always remove an entire days documents at one time.  It is much
faster/easier to delete
the lucene index for the day that I am removing, then looping through
one big index and
removing the entries with the IndexReader.  Since my data is also
partitioned by day in
my database, I essentially do the same thing there with "truncate table."

I use a ParallelMultiSearcher object to search the indexes.  I store
my indexes on a 14
disk 15k rpm  fibre channel RAID 1+0 array (striped mirrors).

I get very good performance in both updating and searching indexes.

On Fri, 8 Oct 2004 06:11:37 -0700 (PDT), Otis Gospodnetic
<ot...@yahoo.com> wrote:
> Jeff,
> 
> These questions are difficult to answer, because the answer depends on
> a number of factors, such as:
> - hardware (memory, disk speed, number of disks...)
> - index complexity and size (number of fields and their size)
> - number of queries/second
> - complexity of queries
> etc.
> 
> I would try putting everything in a single index first, and split it up
> only if I see performance issues.  Going from 1 index to N indices is
> not a lot of work (not a lot of Lucene-related code).  If searching 1
> big index is too slow, split your index, put each index on a separate
> disk, and use ParallelMultiSearcher
> (http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/ParallelMultiSearcher.html)
> to search your indices.
> 
> Otis
> 
> 
> 
> 
> --- Jeff Munson <jm...@newspaperarchive.com> wrote:
> 
> > I am a new user of Lucene.  I am looking to index over 20 million
> > documents (and a lot more someday) and am looking for ideas on the
> > best
> > indexing/search strategy.
> >
> > Which will optimize the Lucene search, one index or multiple indexes?
> > Do I create multiple indexes and merge them all together?  Or do I
> > create multiple indexes and search on the multiple indexes?
> >
> > Any helpful ideas would be appreciated!
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Indexing Strategy for 20 million documents

Posted by Otis Gospodnetic <ot...@yahoo.com>.
--- Christoph Kiehl <ki...@subshell.com> wrote:

> Otis Gospodnetic wrote:
> 
> > I would try putting everything in a single index first, and split
> it up
> > only if I see performance issues.  
> 
> Why would put everything into a single index? I found some benchmark 
> results on the list (starting with your post from 06/08/04) from
> which I 
> got the impression that the performance loss is very small if I
> choose 
> to search in multiple indexes with MultiSearcher instead of using one
> 
> big index.

I think it's simpler to deal with a single index.  One directory, one
set of lock files, etc.  If you don't gain anything by having multiple
indices, why have them?

> > Going from 1 index to N indices is
> > not a lot of work (not a lot of Lucene-related code). 
> 
> How do you get from 1 index to N indices without adding the documents
> again?

Yes, you would have to re-create N Lucene indices.

Otis


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Indexing Strategy for 20 million documents

Posted by Christoph Kiehl <ki...@subshell.com>.
Otis Gospodnetic wrote:

> I would try putting everything in a single index first, and split it up
> only if I see performance issues.  

Why would put everything into a single index? I found some benchmark 
results on the list (starting with your post from 06/08/04) from which I 
got the impression that the performance loss is very small if I choose 
to search in multiple indexes with MultiSearcher instead of using one 
big index.

> Going from 1 index to N indices is
> not a lot of work (not a lot of Lucene-related code). 

How do you get from 1 index to N indices without adding the documents again?

Thanks,
Christoph


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Indexing Strategy for 20 million documents

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Jeff,

These questions are difficult to answer, because the answer depends on
a number of factors, such as:
- hardware (memory, disk speed, number of disks...)
- index complexity and size (number of fields and their size)
- number of queries/second
- complexity of queries
etc.

I would try putting everything in a single index first, and split it up
only if I see performance issues.  Going from 1 index to N indices is
not a lot of work (not a lot of Lucene-related code).  If searching 1
big index is too slow, split your index, put each index on a separate
disk, and use ParallelMultiSearcher
(http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/ParallelMultiSearcher.html)
to search your indices.

Otis


--- Jeff Munson <jm...@newspaperarchive.com> wrote:

> I am a new user of Lucene.  I am looking to index over 20 million
> documents (and a lot more someday) and am looking for ideas on the
> best
> indexing/search strategy.  
> 
> Which will optimize the Lucene search, one index or multiple indexes?
> Do I create multiple indexes and merge them all together?  Or do I
> create multiple indexes and search on the multiple indexes?  
> 
> Any helpful ideas would be appreciated!
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org