You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Jeff Munson <jm...@newspaperarchive.com> on 2004/10/08 14:43:30 UTC
Indexing Strategy for 20 million documents
I am a new user of Lucene. I am looking to index over 20 million
documents (and a lot more someday) and am looking for ideas on the best
indexing/search strategy.
Which will optimize the Lucene search, one index or multiple indexes?
Do I create multiple indexes and merge them all together? Or do I
create multiple indexes and search on the multiple indexes?
Any helpful ideas would be appreciated!
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Indexing Strategy for 20 million documents
Posted by Justin Swanhart <gr...@gmail.com>.
It depends on a lot of factors. I myself use multiple indexes for
about 10M documents.
My documents are transient. Each day I get about 400K and I remove
about 400K. I
always remove an entire days documents at one time. It is much
faster/easier to delete
the lucene index for the day that I am removing, then looping through
one big index and
removing the entries with the IndexReader. Since my data is also
partitioned by day in
my database, I essentially do the same thing there with "truncate table."
I use a ParallelMultiSearcher object to search the indexes. I store
my indexes on a 14
disk 15k rpm fibre channel RAID 1+0 array (striped mirrors).
I get very good performance in both updating and searching indexes.
On Fri, 8 Oct 2004 06:11:37 -0700 (PDT), Otis Gospodnetic
<ot...@yahoo.com> wrote:
> Jeff,
>
> These questions are difficult to answer, because the answer depends on
> a number of factors, such as:
> - hardware (memory, disk speed, number of disks...)
> - index complexity and size (number of fields and their size)
> - number of queries/second
> - complexity of queries
> etc.
>
> I would try putting everything in a single index first, and split it up
> only if I see performance issues. Going from 1 index to N indices is
> not a lot of work (not a lot of Lucene-related code). If searching 1
> big index is too slow, split your index, put each index on a separate
> disk, and use ParallelMultiSearcher
> (http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/ParallelMultiSearcher.html)
> to search your indices.
>
> Otis
>
>
>
>
> --- Jeff Munson <jm...@newspaperarchive.com> wrote:
>
> > I am a new user of Lucene. I am looking to index over 20 million
> > documents (and a lot more someday) and am looking for ideas on the
> > best
> > indexing/search strategy.
> >
> > Which will optimize the Lucene search, one index or multiple indexes?
> > Do I create multiple indexes and merge them all together? Or do I
> > create multiple indexes and search on the multiple indexes?
> >
> > Any helpful ideas would be appreciated!
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Indexing Strategy for 20 million documents
Posted by Otis Gospodnetic <ot...@yahoo.com>.
--- Christoph Kiehl <ki...@subshell.com> wrote:
> Otis Gospodnetic wrote:
>
> > I would try putting everything in a single index first, and split
> it up
> > only if I see performance issues.
>
> Why would put everything into a single index? I found some benchmark
> results on the list (starting with your post from 06/08/04) from
> which I
> got the impression that the performance loss is very small if I
> choose
> to search in multiple indexes with MultiSearcher instead of using one
>
> big index.
I think it's simpler to deal with a single index. One directory, one
set of lock files, etc. If you don't gain anything by having multiple
indices, why have them?
> > Going from 1 index to N indices is
> > not a lot of work (not a lot of Lucene-related code).
>
> How do you get from 1 index to N indices without adding the documents
> again?
Yes, you would have to re-create N Lucene indices.
Otis
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Indexing Strategy for 20 million documents
Posted by Christoph Kiehl <ki...@subshell.com>.
Otis Gospodnetic wrote:
> I would try putting everything in a single index first, and split it up
> only if I see performance issues.
Why would put everything into a single index? I found some benchmark
results on the list (starting with your post from 06/08/04) from which I
got the impression that the performance loss is very small if I choose
to search in multiple indexes with MultiSearcher instead of using one
big index.
> Going from 1 index to N indices is
> not a lot of work (not a lot of Lucene-related code).
How do you get from 1 index to N indices without adding the documents again?
Thanks,
Christoph
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Indexing Strategy for 20 million documents
Posted by Otis Gospodnetic <ot...@yahoo.com>.
Jeff,
These questions are difficult to answer, because the answer depends on
a number of factors, such as:
- hardware (memory, disk speed, number of disks...)
- index complexity and size (number of fields and their size)
- number of queries/second
- complexity of queries
etc.
I would try putting everything in a single index first, and split it up
only if I see performance issues. Going from 1 index to N indices is
not a lot of work (not a lot of Lucene-related code). If searching 1
big index is too slow, split your index, put each index on a separate
disk, and use ParallelMultiSearcher
(http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/ParallelMultiSearcher.html)
to search your indices.
Otis
--- Jeff Munson <jm...@newspaperarchive.com> wrote:
> I am a new user of Lucene. I am looking to index over 20 million
> documents (and a lot more someday) and am looking for ideas on the
> best
> indexing/search strategy.
>
> Which will optimize the Lucene search, one index or multiple indexes?
> Do I create multiple indexes and merge them all together? Or do I
> create multiple indexes and search on the multiple indexes?
>
> Any helpful ideas would be appreciated!
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org