You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Steven White <sw...@gmail.com> on 2016/07/06 17:27:27 UTC

Full re-index without downtime

Hi everyone,

In my environment, I have use cases where I need to fully re-index my
data.  This happens because Solr's schema requires changes based on changes
made to my data source, the DB.  For example, my DB schema may change so
that it now has a whole new set of field added or removed (on records), or
the data type changed (on fields).  When that happens, the only solution I
have right now is to drop the current Solr index, update Solr's schema.xml,
re-index my data (I use Solr's core admin to dynamical do all this).

The issue with my current solution is during the re-indexing, which right
now takes 10 hours (expect it to take over 30 hours as my data keeps on
growing) search via Solr is not available.  Sure, I can enable search while
the data is being re-indexed, but then I get partial results.

My question is this: how can I avoid this so there is minimal downtime,
under 1 min.?  I was thinking of creating a second core (again dynamically)
and re-index into it (after setting up the new schema) and once the
re-index is fully done, switch over to the new core and drop the index from
the old core and then delete the old core, and rename the new core to the
old core (original core).

Would the above work or is there a better way to do this?  How do you guys
solve this problem?

Again, my goal is to minimize downtime during re-indexing when Solr's
schema is drastically changed (requiring re-indexing).

Thanks in advanced.

Steve

Re: Full re-index without downtime

Posted by Jeff Wartes <jw...@whitepages.com>.
A variation on #1 here - Use the same cluster, create a new collection, but use the createNodeSet option to logically partition your cluster so no node has both the old and new collection.

If your clients all reference a collection alias, instead of a collection name, then all you need to do when the replacement index is ready is move the alias, (instant and atomic) and then clean up by dropping the old collection. Repeat as necessary.

You say you’re using the CoreAdmin API though, which implies you’re not using SolrCloud, which is a requirement of the above.


On 7/6/16, 10:42 AM, "Steven Bower" <sb...@alcyon.net> wrote:

>There are two options as I see it..
>
>1. Do something like you describe and create a secondary index, index into
>it, then switch... I personally would create a completely separate solr
>cloud alongside my existing one vs new core in the same cloud as you might
>see some negative impacts on GC caused by the indexing load.
>
>2. Tag each record with a field (eg "generation") that identifies which
>generation of data a record is from.. when querying filter on only the
>generation of data that is complete.. new records get a new generation..
>the only problem with this is changing field types doesn't really work with
>the same field names.. but if you used dynamic fields instead of static the
>field name would change anyway which isn't a problem then.
>
>We use both of these patterns in different applications..
>
>steve
>
>On Wed, Jul 6, 2016 at 1:27 PM Steven White <sw...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> In my environment, I have use cases where I need to fully re-index my
>> data.  This happens because Solr's schema requires changes based on changes
>> made to my data source, the DB.  For example, my DB schema may change so
>> that it now has a whole new set of field added or removed (on records), or
>> the data type changed (on fields).  When that happens, the only solution I
>> have right now is to drop the current Solr index, update Solr's schema.xml,
>> re-index my data (I use Solr's core admin to dynamical do all this).
>>
>> The issue with my current solution is during the re-indexing, which right
>> now takes 10 hours (expect it to take over 30 hours as my data keeps on
>> growing) search via Solr is not available.  Sure, I can enable search while
>> the data is being re-indexed, but then I get partial results.
>>
>> My question is this: how can I avoid this so there is minimal downtime,
>> under 1 min.?  I was thinking of creating a second core (again dynamically)
>> and re-index into it (after setting up the new schema) and once the
>> re-index is fully done, switch over to the new core and drop the index from
>> the old core and then delete the old core, and rename the new core to the
>> old core (original core).
>>
>> Would the above work or is there a better way to do this?  How do you guys
>> solve this problem?
>>
>> Again, my goal is to minimize downtime during re-indexing when Solr's
>> schema is drastically changed (requiring re-indexing).
>>
>> Thanks in advanced.
>>
>> Steve
>>


Re: Full re-index without downtime

Posted by Steven Bower <sb...@alcyon.net>.
There are two options as I see it..

1. Do something like you describe and create a secondary index, index into
it, then switch... I personally would create a completely separate solr
cloud alongside my existing one vs new core in the same cloud as you might
see some negative impacts on GC caused by the indexing load.

2. Tag each record with a field (eg "generation") that identifies which
generation of data a record is from.. when querying filter on only the
generation of data that is complete.. new records get a new generation..
the only problem with this is changing field types doesn't really work with
the same field names.. but if you used dynamic fields instead of static the
field name would change anyway which isn't a problem then.

We use both of these patterns in different applications..

steve

On Wed, Jul 6, 2016 at 1:27 PM Steven White <sw...@gmail.com> wrote:

> Hi everyone,
>
> In my environment, I have use cases where I need to fully re-index my
> data.  This happens because Solr's schema requires changes based on changes
> made to my data source, the DB.  For example, my DB schema may change so
> that it now has a whole new set of field added or removed (on records), or
> the data type changed (on fields).  When that happens, the only solution I
> have right now is to drop the current Solr index, update Solr's schema.xml,
> re-index my data (I use Solr's core admin to dynamical do all this).
>
> The issue with my current solution is during the re-indexing, which right
> now takes 10 hours (expect it to take over 30 hours as my data keeps on
> growing) search via Solr is not available.  Sure, I can enable search while
> the data is being re-indexed, but then I get partial results.
>
> My question is this: how can I avoid this so there is minimal downtime,
> under 1 min.?  I was thinking of creating a second core (again dynamically)
> and re-index into it (after setting up the new schema) and once the
> re-index is fully done, switch over to the new core and drop the index from
> the old core and then delete the old core, and rename the new core to the
> old core (original core).
>
> Would the above work or is there a better way to do this?  How do you guys
> solve this problem?
>
> Again, my goal is to minimize downtime during re-indexing when Solr's
> schema is drastically changed (requiring re-indexing).
>
> Thanks in advanced.
>
> Steve
>