You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Kelly, Frank" <fr...@here.com> on 2016/06/21 19:33:49 UTC

SolrCloud: Adding a very large collection to a pre-existing cluster

We have about 200 million documents (~70 GB) we need to keep indexed across 3 collections.

Currently 2 of the 3 collections are already indexed (roughly 90m docs).

We'd like to create the remaining collection (about 100 m documents) but minimizing the performance impact on the existing collections on Solr servers during that Time.

Is there some way to do this either by

  1.  Creating the collection in another environment and shipping the (underlying Lucene) index files
  2.  Creating the collection on (dedicated) new machines that we add to the SolrCloud cluster?

Thoughts, comments or suggestions appreciated,

Best

-Frank Kelly

Re: SolrCloud: Adding a very large collection to a pre-existing cluster

Posted by Erick Erickson <er...@gmail.com>.

One other option is to index "somewhere else", then use the collections API
to "addreplica"s on your prod cluster. Then perhaps delete replica on the
nodes that are "somewhere else".

Best,
Erick
On Jun 21, 2016 4:27 PM, "Jeff Wartes" <jw...@whitepages.com> wrote:

There’s no official way of doing #1, but there are some less official ways:
1. The Backup/Restore API provides some hooks into loading pre-existing
data dirs into an existing collection. Lots of caveats.
2. If you don’t have many shards, there’s always rsync/reload.
3. There are some third-party tools that help with this kind of thing:
a. https://github.com/whitepages/solrcloud_manager (primarily a command
line tool)
b. https://github.com/bloomreach/solrcloud-haft (primarily a library)

For #2, absolutely. Spin up some new nodes in your cluster, and then use
the “createNodeSet” parameter when creating the new collection to restrict
to those new nodes:
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api1

On 6/21/16, 12:33 PM, "Kelly, Frank" <fr...@here.com> wrote:

>We have about 200 million documents (~70 GB) we need to keep indexed
across 3 collections.
>
>Currently 2 of the 3 collections are already indexed (roughly 90m docs).
>
>We'd like to create the remaining collection (about 100 m documents) but
minimizing the performance impact on the existing collections on Solr
servers during that Time.
>
>Is there some way to do this either by
>
>  1.  Creating the collection in another environment and shipping the
(underlying Lucene) index files
>  2.  Creating the collection on (dedicated) new machines that we add to
the SolrCloud cluster?
>
>Thoughts, comments or suggestions appreciated,
>
>Best
>
>-Frank Kelly
>

Re: SolrCloud: Adding a very large collection to a pre-existing cluster

Posted by Jeff Wartes <jw...@whitepages.com>.

There’s no official way of doing #1, but there are some less official ways:
1. The Backup/Restore API provides some hooks into loading pre-existing data dirs into an existing collection. Lots of caveats.
2. If you don’t have many shards, there’s always rsync/reload.
3. There are some third-party tools that help with this kind of thing:
a. https://github.com/whitepages/solrcloud_manager (primarily a command line tool)
b. https://github.com/bloomreach/solrcloud-haft (primarily a library)

For #2, absolutely. Spin up some new nodes in your cluster, and then use the “createNodeSet” parameter when creating the new collection to restrict to those new nodes:
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api1

On 6/21/16, 12:33 PM, "Kelly, Frank" <fr...@here.com> wrote:

>We have about 200 million documents (~70 GB) we need to keep indexed across 3 collections.
>
>Currently 2 of the 3 collections are already indexed (roughly 90m docs).
>
>We'd like to create the remaining collection (about 100 m documents) but minimizing the performance impact on the existing collections on Solr servers during that Time.
>
>Is there some way to do this either by
>
>  1.  Creating the collection in another environment and shipping the (underlying Lucene) index files
>  2.  Creating the collection on (dedicated) new machines that we add to the SolrCloud cluster?
>
>Thoughts, comments or suggestions appreciated,
>
>Best
>
>-Frank Kelly
>