You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "George P. Stathis" <gs...@traackr.com> on 2011/05/10 04:13:48 UTC

Solr approaches to re-indexing large document corpus

We are looking for some recommendations around systematically re-indexing in
Solr an ever growing corpus of documents (tens of millions now, hundreds of
millions in than a year) without taking the currently running index down.
Re-indexing is needed on a periodic bases because:

   - New features are introduced around searching the existing corpus that
   require additional schema fields which we can't always anticipate in advance
   - The corpus is indexed across multiple shards. When it grows past a
   certain threshold, we need to create more shards and re-balance documents
   evenly across all of them (which SolrCloud does not seem to yet support).

The current index receives very frequent updates and additions, which need
to be available for search within minutes. Therefore, approaches where the
corpus is re-indexed in batch offline don't really work as by the time the
batch is finished, new documents will have been made available.

The approaches we are looking into at the moment are:

   - Create a new cluster of shards and batch re-index there while the old
   cluster is still available for searching. New documents that are not part of
   the re-indexed batch are sent to both the old cluster and the new cluster.
   When ready to switch, point the load balancer to the new cluster.
   - Use CoreAdmin: spawn a new core per shard and send the re-indexed batch
   to the new cores. New documents that are not part of the re-indexed batch
   are sent to both the old cores and the new cores. When ready to switch, use
   CoreAdmin to dynamically swap cores.

We'd appreciate if folks can either confirm or poke holes in either or all
these approaches. Is one more appropriate than the other? Or are we
completely off? Thank you in advance.