You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Hui Liu <hl...@opentext.com> on 2016/06/09 15:50:11 UTC

Questions regarding re-index when using Solr as a data source

Hi,

              We are porting an application currently hosted in Oracle 11g to Solr Cloud 6.x, i.e we plan to migrate all tables in Oracle as collections in Solr, index them, and build search tools on top of this; the goal is we won't be using Oracle at all after this has been implemented; every fields in Solr will have 'stored=true' and selectively a subset of searchable fields will have 'indexed=true'; the question is what steps we should follow if we need to re-index a collection after making some schema changes - mostly we only add new fields to store, or make a non-indexed field as indexed, we normally do not delete or rename any existing fields; according to this url: https://wiki.apache.org/solr/HowToReindex it seems we need to setup a 'intermediate' Solr1 to only store the data themselves without any indexing, then have another Solr2 setup to store the indexed data, and in case of re-index, just delete all the documents in Solr2 for the collection and re-import data from Solr1 into Solr2 using SolrEntityProcessor (from dataimport handler)? Is this still the recommended approach? I can see the downside of this approach is if we have tremendous amount of data for a collection (some of our collection could have several billions of documents), re-import it from Solr1 to Solr2 may take a few hours or even days, and during this time, users cannot query the data, is there any better way to do this and avoid this type of down time? Any feedback is appreciated!

Regards,
Hui Liu
Opentext, Inc.

RE: Questions regarding re-index when using Solr as a data source

Posted by Hui Liu <hl...@opentext.com>.
Thank you Walter.

-----Original Message-----
From: Walter Underwood [mailto:wunder@wunderwood.org] 
Sent: Friday, June 10, 2016 3:53 PM
To: solr-user@lucene.apache.org
Subject: Re: Questions regarding re-index when using Solr as a data source

Those are brand new features that I have not used, so I can’t comment on them.

But I know they do not make Solr into a database.

If you need a transactional database that can support search, you probably want MarkLogic. I worked at MarkLogic for a couple of years. In some ways, MarkLogic is like Solr, but the support for transactions goes very deep. It is not something you can put on top of a search engine.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jun 10, 2016, at 12:39 PM, Hui Liu <hl...@opentext.com> wrote:
> 
> What if we plan to use Solr version 6.x? this url says it support 2 different update modes: atomic update and optimistic concurrency:
> 
> https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents
> 
> I tested 'optimistic concurrency' and it appears to be working, i.e if a document I am updating got changed by another person I will get error if I supply a _version_ value, So maybe you are referring to an older version of Solr?
> 
> Regards,
> Hui
> 
> -----Original Message-----
> From: Walter Underwood [mailto:wunder@wunderwood.org] 
> Sent: Friday, June 10, 2016 11:18 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Questions regarding re-index when using Solr as a data source
> 
> Solr does not have transactions at all. The “commit” is really “submit batch”.
> 
> Solr does not have update. You can add, delete, or replace an entire document.
> 
> There is no optimistic concurrency control because there is no concurrency control. Clients can concurrently add documents to a batch, then any client can submit the entire batch.
> 
> Replication is not transactional. Replication is a file copy of the underlying indexes (classic) or copying the documents in a batch (Solr Cloud).
> 
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
>> On Jun 10, 2016, at 7:41 AM, Hui Liu <hl...@opentext.com> wrote:
>> 
>> Walter,
>> 
>> 	Thank you for your advice. We are new to Solr and have been using Oracle for past 10+ years, so we are used to the idea of having a tool that can be used as both data store and also searchable by having indexes on top of it. I guess the reason we are considering Solr as data store is due to it has some features of a database that our application requires, such as 1) be able to detect duplicate record by having a unique field; 2) allow us to do concurrent update by using Optimistic concurrency control feature; 3) its 'replication' feature allowing us to store multiple copies of data; so if we were to use a file system, we will not have the above features (at least not 1 and 2) and have to implement those ourselves. The other option is to pick another database tool such as Mysql or Cassandra, then we will need to learn and support an additional tool besides Solr; but you brought up several very good points about operational factors we should consider if we pick Solr as a data store. Also our application is more of a OLTP than OLAP. I will update our colleagues and stakeholders about these concerns. Thanks again!
>> 
>> Regards,
>> Hui
>> -----Original Message-----
>> From: Walter Underwood [mailto:wunder@wunderwood.org] 
>> Sent: Thursday, June 09, 2016 1:24 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Questions regarding re-index when using Solr as a data source
>> 
>> In the HowToReindex page, under “Using Solr as a Data Store”, it says this: "Don't do this unless you have no other option. Solr is not really designed for this role.” So don’t start by planning to do this.
>> 
>> Using a second copy of Solr is still using Solr as a repository. That doesn’t satisfy any sort of requirements for disaster recovery. How do you know that data is good? How do you make a third copy? How do you roll back to a previous version? How do you deal with a security breach that affects all your systems? Are the systems in the same data center? How do you deal with ransomware (U. of Calgary paid $20K yesterday)?
>> 
>> If a consultant suggested this to me, I’d probably just give up and get a different consultant.
>> 
>> Here is what we do for batch loading.
>> 
>> 1. For each Solr collection, we define a JSONL feed format, with a JSON Schema.
>> 2. The owners of the data write an extractor to pull the data out of wherever it is, then generate the JSON feed.
>> 3. We validate the JSON feed against the JSON schema.
>> 4. If the feed is valid, we save it to Amazon S3 along with a manifest which lists the version of the JSON Schema.
>> 5. Then a multi-threaded loader reads the feed and sends it to Solr.
>> 
>> Reloading is safe and easy, because all the feeds in S3 are valid.
>> 
>> Storing backups in S3 instead of running a second Solr is massively cheaper, easier, and safer.
>> 
>> We also have a clear contract between the content owners and the search team. That contract is enforced by the JSON Schema on every single batch.
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Jun 9, 2016, at 9:51 AM, Hui Liu <hl...@opentext.com> wrote:
>>> 
>>> Hi Walter,
>>> 
>>> Thank you for the reply, sorry I need to clarify what I mean by 'migrate tables' from Oracle to Solr, we are not literally move existing records from Oracle to Solr, instead, we are building a new application directly feed data into Solr as document and fields, in parallel of another existing application which feeds the same data into Oracle tables/columns, of course, the Solr schema will be somewhat different than Oracle; also we only keep those data for 90 days for user to search on, we hope once we run both system in parallel for some time (> 90 days), we will build up enough new data in Solr and we no longer need any old data in Oracle, by then we will be able to use Solr as our only data store.
>>> 
>>> It sounds to me that we may need to consider save the data into either file system, or another database, in case we need to rebuild the indexes; and the reason I mentioned to save data into another Solr system is by reading this info from https://wiki.apache.org/solr/HowToReindex : so just trying to get a feedback on if there is any update on this approach? And any better way to do this to minimize the downtime caused by the schema change and re-index? For example, in Oracle, we are able to add a new column or new index online without any impact of existing queries as existing indexes are intact.
>>> 
>>> Alternatives when a traditional reindex isn't possible
>>> 
>>> Sometimes the option of "do your indexing again" is difficult. Perhaps the original data is very slow to access, or it may be difficult to get in the first place.
>>> 
>>> Here's where we go against our own advice that we just gave you. Above we said "don't use Solr itself as a datasource" ... but one way to deal with data availability problems is to set up a completely separate Solr instance (not distributed, which for SolrCloud means numShards=1) whose only job is to store the data, then use the SolrEntityProcessor in the DataImportHandler to index from that instance to your real Solr install. If you need to reindex, just run the import again on your real installation. Your schema for the intermediate Solr install would have stored="true" and indexed="false" for all fields, and would only use basic types like int, long, and string. It would not have any copyFields.
>>> 
>>> This is the approach used by the Smithsonian for their Solr installation, because getting access to the source databases for the individual entities within the organization is very difficult. This way they can reindex the online Solr at any time without having to get special permission from all those entities. When they index new content, it goes into a copy of Solr configured for storage only, not in-depth searching. Their main Solr instance uses SolrEntityProcessor to import from the intermediate Solr servers, so they can always reindex.
>>> 
>>> Regards,
>>> Hui
>>> 
>>> -----Original Message-----
>>> From: Walter Underwood [mailto:wunder@wunderwood.org]
>>> Sent: Thursday, June 09, 2016 12:19 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Questions regarding re-index when using Solr as a data source
>>> 
>>> First, using Solr as a repository is pretty risky. I would keep the official copy of the data in a database, not in Solr.
>>> 
>>> Second, you can’t “migrate tables” because Solr doesn’t have tables. You need to turn the tables into documents, then index the documents. It can take a lot of joins to flatten a relational schema into Solr documents.
>>> 
>>> Solr does not support schema migration, so yes, you will need to save off all the documents, then reload them. I would save them to files. It makes no sense to put them in another copy of Solr.
>>> 
>>> Changing the schema will be difficult and time-consuming, but you’ll probably run into much worse problems trying to use Solr as a repository.
>>> 
>>> wunder
>>> Walter Underwood
>>> wunder@wunderwood.org<ma...@wunderwood.org>
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>> 
>>>> On Jun 9, 2016, at 8:50 AM, Hui Liu <hl...@opentext.com>> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>>           We are porting an application currently hosted in Oracle 11g to Solr Cloud 6.x, i.e we plan to migrate all tables in Oracle as collections in Solr, index them, and build search tools on top of this; the goal is we won't be using Oracle at all after this has been implemented; every fields in Solr will have 'stored=true' and selectively a subset of searchable fields will have 'indexed=true'; the question is what steps we should follow if we need to re-index a collection after making some schema changes - mostly we only add new fields to store, or make a non-indexed field as indexed, we normally do not delete or rename any existing fields; according to this url: https://wiki.apache.org/solr/HowToReindex it seems we need to setup a 'intermediate' Solr1 to only store the data themselves without any indexing, then have another Solr2 setup to store the indexed data, and in case of re-index, just delete all the documents in Solr2 for the collection and re-import data from Solr1 into Solr2 using SolrEntityProcessor (from dataimport handler)? Is this still the recommended approach? I can see the downside of this approach is if we have tremendous amount of data for a collection (some of our collection could have several billions of documents), re-import it from Solr1 to Solr2 may take a few hours or even days, and during this time, users cannot query the data, is there any better way to do this and avoid this type of down time? Any feedback is appreciated!
>>>> 
>>>> Regards,
>>>> Hui Liu
>>>> Opentext, Inc.
>>> 
>>> 
>> 
> 


Re: Questions regarding re-index when using Solr as a data source

Posted by Walter Underwood <wu...@wunderwood.org>.
Those are brand new features that I have not used, so I can’t comment on them.

But I know they do not make Solr into a database.

If you need a transactional database that can support search, you probably want MarkLogic. I worked at MarkLogic for a couple of years. In some ways, MarkLogic is like Solr, but the support for transactions goes very deep. It is not something you can put on top of a search engine.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jun 10, 2016, at 12:39 PM, Hui Liu <hl...@opentext.com> wrote:
> 
> What if we plan to use Solr version 6.x? this url says it support 2 different update modes: atomic update and optimistic concurrency:
> 
> https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents
> 
> I tested 'optimistic concurrency' and it appears to be working, i.e if a document I am updating got changed by another person I will get error if I supply a _version_ value, So maybe you are referring to an older version of Solr?
> 
> Regards,
> Hui
> 
> -----Original Message-----
> From: Walter Underwood [mailto:wunder@wunderwood.org] 
> Sent: Friday, June 10, 2016 11:18 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Questions regarding re-index when using Solr as a data source
> 
> Solr does not have transactions at all. The “commit” is really “submit batch”.
> 
> Solr does not have update. You can add, delete, or replace an entire document.
> 
> There is no optimistic concurrency control because there is no concurrency control. Clients can concurrently add documents to a batch, then any client can submit the entire batch.
> 
> Replication is not transactional. Replication is a file copy of the underlying indexes (classic) or copying the documents in a batch (Solr Cloud).
> 
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
>> On Jun 10, 2016, at 7:41 AM, Hui Liu <hl...@opentext.com> wrote:
>> 
>> Walter,
>> 
>> 	Thank you for your advice. We are new to Solr and have been using Oracle for past 10+ years, so we are used to the idea of having a tool that can be used as both data store and also searchable by having indexes on top of it. I guess the reason we are considering Solr as data store is due to it has some features of a database that our application requires, such as 1) be able to detect duplicate record by having a unique field; 2) allow us to do concurrent update by using Optimistic concurrency control feature; 3) its 'replication' feature allowing us to store multiple copies of data; so if we were to use a file system, we will not have the above features (at least not 1 and 2) and have to implement those ourselves. The other option is to pick another database tool such as Mysql or Cassandra, then we will need to learn and support an additional tool besides Solr; but you brought up several very good points about operational factors we should consider if we pick Solr as a data store. Also our application is more of a OLTP than OLAP. I will update our colleagues and stakeholders about these concerns. Thanks again!
>> 
>> Regards,
>> Hui
>> -----Original Message-----
>> From: Walter Underwood [mailto:wunder@wunderwood.org] 
>> Sent: Thursday, June 09, 2016 1:24 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Questions regarding re-index when using Solr as a data source
>> 
>> In the HowToReindex page, under “Using Solr as a Data Store”, it says this: "Don't do this unless you have no other option. Solr is not really designed for this role.” So don’t start by planning to do this.
>> 
>> Using a second copy of Solr is still using Solr as a repository. That doesn’t satisfy any sort of requirements for disaster recovery. How do you know that data is good? How do you make a third copy? How do you roll back to a previous version? How do you deal with a security breach that affects all your systems? Are the systems in the same data center? How do you deal with ransomware (U. of Calgary paid $20K yesterday)?
>> 
>> If a consultant suggested this to me, I’d probably just give up and get a different consultant.
>> 
>> Here is what we do for batch loading.
>> 
>> 1. For each Solr collection, we define a JSONL feed format, with a JSON Schema.
>> 2. The owners of the data write an extractor to pull the data out of wherever it is, then generate the JSON feed.
>> 3. We validate the JSON feed against the JSON schema.
>> 4. If the feed is valid, we save it to Amazon S3 along with a manifest which lists the version of the JSON Schema.
>> 5. Then a multi-threaded loader reads the feed and sends it to Solr.
>> 
>> Reloading is safe and easy, because all the feeds in S3 are valid.
>> 
>> Storing backups in S3 instead of running a second Solr is massively cheaper, easier, and safer.
>> 
>> We also have a clear contract between the content owners and the search team. That contract is enforced by the JSON Schema on every single batch.
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Jun 9, 2016, at 9:51 AM, Hui Liu <hl...@opentext.com> wrote:
>>> 
>>> Hi Walter,
>>> 
>>> Thank you for the reply, sorry I need to clarify what I mean by 'migrate tables' from Oracle to Solr, we are not literally move existing records from Oracle to Solr, instead, we are building a new application directly feed data into Solr as document and fields, in parallel of another existing application which feeds the same data into Oracle tables/columns, of course, the Solr schema will be somewhat different than Oracle; also we only keep those data for 90 days for user to search on, we hope once we run both system in parallel for some time (> 90 days), we will build up enough new data in Solr and we no longer need any old data in Oracle, by then we will be able to use Solr as our only data store.
>>> 
>>> It sounds to me that we may need to consider save the data into either file system, or another database, in case we need to rebuild the indexes; and the reason I mentioned to save data into another Solr system is by reading this info from https://wiki.apache.org/solr/HowToReindex : so just trying to get a feedback on if there is any update on this approach? And any better way to do this to minimize the downtime caused by the schema change and re-index? For example, in Oracle, we are able to add a new column or new index online without any impact of existing queries as existing indexes are intact.
>>> 
>>> Alternatives when a traditional reindex isn't possible
>>> 
>>> Sometimes the option of "do your indexing again" is difficult. Perhaps the original data is very slow to access, or it may be difficult to get in the first place.
>>> 
>>> Here's where we go against our own advice that we just gave you. Above we said "don't use Solr itself as a datasource" ... but one way to deal with data availability problems is to set up a completely separate Solr instance (not distributed, which for SolrCloud means numShards=1) whose only job is to store the data, then use the SolrEntityProcessor in the DataImportHandler to index from that instance to your real Solr install. If you need to reindex, just run the import again on your real installation. Your schema for the intermediate Solr install would have stored="true" and indexed="false" for all fields, and would only use basic types like int, long, and string. It would not have any copyFields.
>>> 
>>> This is the approach used by the Smithsonian for their Solr installation, because getting access to the source databases for the individual entities within the organization is very difficult. This way they can reindex the online Solr at any time without having to get special permission from all those entities. When they index new content, it goes into a copy of Solr configured for storage only, not in-depth searching. Their main Solr instance uses SolrEntityProcessor to import from the intermediate Solr servers, so they can always reindex.
>>> 
>>> Regards,
>>> Hui
>>> 
>>> -----Original Message-----
>>> From: Walter Underwood [mailto:wunder@wunderwood.org]
>>> Sent: Thursday, June 09, 2016 12:19 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Questions regarding re-index when using Solr as a data source
>>> 
>>> First, using Solr as a repository is pretty risky. I would keep the official copy of the data in a database, not in Solr.
>>> 
>>> Second, you can’t “migrate tables” because Solr doesn’t have tables. You need to turn the tables into documents, then index the documents. It can take a lot of joins to flatten a relational schema into Solr documents.
>>> 
>>> Solr does not support schema migration, so yes, you will need to save off all the documents, then reload them. I would save them to files. It makes no sense to put them in another copy of Solr.
>>> 
>>> Changing the schema will be difficult and time-consuming, but you’ll probably run into much worse problems trying to use Solr as a repository.
>>> 
>>> wunder
>>> Walter Underwood
>>> wunder@wunderwood.org<ma...@wunderwood.org>
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>> 
>>>> On Jun 9, 2016, at 8:50 AM, Hui Liu <hl...@opentext.com>> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>>           We are porting an application currently hosted in Oracle 11g to Solr Cloud 6.x, i.e we plan to migrate all tables in Oracle as collections in Solr, index them, and build search tools on top of this; the goal is we won't be using Oracle at all after this has been implemented; every fields in Solr will have 'stored=true' and selectively a subset of searchable fields will have 'indexed=true'; the question is what steps we should follow if we need to re-index a collection after making some schema changes - mostly we only add new fields to store, or make a non-indexed field as indexed, we normally do not delete or rename any existing fields; according to this url: https://wiki.apache.org/solr/HowToReindex it seems we need to setup a 'intermediate' Solr1 to only store the data themselves without any indexing, then have another Solr2 setup to store the indexed data, and in case of re-index, just delete all the documents in Solr2 for the collection and re-import data from Solr1 into Solr2 using SolrEntityProcessor (from dataimport handler)? Is this still the recommended approach? I can see the downside of this approach is if we have tremendous amount of data for a collection (some of our collection could have several billions of documents), re-import it from Solr1 to Solr2 may take a few hours or even days, and during this time, users cannot query the data, is there any better way to do this and avoid this type of down time? Any feedback is appreciated!
>>>> 
>>>> Regards,
>>>> Hui Liu
>>>> Opentext, Inc.
>>> 
>>> 
>> 
> 


RE: Questions regarding re-index when using Solr as a data source

Posted by Hui Liu <hl...@opentext.com>.
What if we plan to use Solr version 6.x? this url says it support 2 different update modes: atomic update and optimistic concurrency:

https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents

I tested 'optimistic concurrency' and it appears to be working, i.e if a document I am updating got changed by another person I will get error if I supply a _version_ value, So maybe you are referring to an older version of Solr?

Regards,
Hui

-----Original Message-----
From: Walter Underwood [mailto:wunder@wunderwood.org] 
Sent: Friday, June 10, 2016 11:18 AM
To: solr-user@lucene.apache.org
Subject: Re: Questions regarding re-index when using Solr as a data source

Solr does not have transactions at all. The “commit” is really “submit batch”.

Solr does not have update. You can add, delete, or replace an entire document.

There is no optimistic concurrency control because there is no concurrency control. Clients can concurrently add documents to a batch, then any client can submit the entire batch.

Replication is not transactional. Replication is a file copy of the underlying indexes (classic) or copying the documents in a batch (Solr Cloud).

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jun 10, 2016, at 7:41 AM, Hui Liu <hl...@opentext.com> wrote:
> 
> Walter,
> 
> 	Thank you for your advice. We are new to Solr and have been using Oracle for past 10+ years, so we are used to the idea of having a tool that can be used as both data store and also searchable by having indexes on top of it. I guess the reason we are considering Solr as data store is due to it has some features of a database that our application requires, such as 1) be able to detect duplicate record by having a unique field; 2) allow us to do concurrent update by using Optimistic concurrency control feature; 3) its 'replication' feature allowing us to store multiple copies of data; so if we were to use a file system, we will not have the above features (at least not 1 and 2) and have to implement those ourselves. The other option is to pick another database tool such as Mysql or Cassandra, then we will need to learn and support an additional tool besides Solr; but you brought up several very good points about operational factors we should consider if we pick Solr as a data store. Also our application is more of a OLTP than OLAP. I will update our colleagues and stakeholders about these concerns. Thanks again!
> 
> Regards,
> Hui
> -----Original Message-----
> From: Walter Underwood [mailto:wunder@wunderwood.org] 
> Sent: Thursday, June 09, 2016 1:24 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Questions regarding re-index when using Solr as a data source
> 
> In the HowToReindex page, under “Using Solr as a Data Store”, it says this: "Don't do this unless you have no other option. Solr is not really designed for this role.” So don’t start by planning to do this.
> 
> Using a second copy of Solr is still using Solr as a repository. That doesn’t satisfy any sort of requirements for disaster recovery. How do you know that data is good? How do you make a third copy? How do you roll back to a previous version? How do you deal with a security breach that affects all your systems? Are the systems in the same data center? How do you deal with ransomware (U. of Calgary paid $20K yesterday)?
> 
> If a consultant suggested this to me, I’d probably just give up and get a different consultant.
> 
> Here is what we do for batch loading.
> 
> 1. For each Solr collection, we define a JSONL feed format, with a JSON Schema.
> 2. The owners of the data write an extractor to pull the data out of wherever it is, then generate the JSON feed.
> 3. We validate the JSON feed against the JSON schema.
> 4. If the feed is valid, we save it to Amazon S3 along with a manifest which lists the version of the JSON Schema.
> 5. Then a multi-threaded loader reads the feed and sends it to Solr.
> 
> Reloading is safe and easy, because all the feeds in S3 are valid.
> 
> Storing backups in S3 instead of running a second Solr is massively cheaper, easier, and safer.
> 
> We also have a clear contract between the content owners and the search team. That contract is enforced by the JSON Schema on every single batch.
> 
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
>> On Jun 9, 2016, at 9:51 AM, Hui Liu <hl...@opentext.com> wrote:
>> 
>> Hi Walter,
>> 
>> Thank you for the reply, sorry I need to clarify what I mean by 'migrate tables' from Oracle to Solr, we are not literally move existing records from Oracle to Solr, instead, we are building a new application directly feed data into Solr as document and fields, in parallel of another existing application which feeds the same data into Oracle tables/columns, of course, the Solr schema will be somewhat different than Oracle; also we only keep those data for 90 days for user to search on, we hope once we run both system in parallel for some time (> 90 days), we will build up enough new data in Solr and we no longer need any old data in Oracle, by then we will be able to use Solr as our only data store.
>> 
>> It sounds to me that we may need to consider save the data into either file system, or another database, in case we need to rebuild the indexes; and the reason I mentioned to save data into another Solr system is by reading this info from https://wiki.apache.org/solr/HowToReindex : so just trying to get a feedback on if there is any update on this approach? And any better way to do this to minimize the downtime caused by the schema change and re-index? For example, in Oracle, we are able to add a new column or new index online without any impact of existing queries as existing indexes are intact.
>> 
>> Alternatives when a traditional reindex isn't possible
>> 
>> Sometimes the option of "do your indexing again" is difficult. Perhaps the original data is very slow to access, or it may be difficult to get in the first place.
>> 
>> Here's where we go against our own advice that we just gave you. Above we said "don't use Solr itself as a datasource" ... but one way to deal with data availability problems is to set up a completely separate Solr instance (not distributed, which for SolrCloud means numShards=1) whose only job is to store the data, then use the SolrEntityProcessor in the DataImportHandler to index from that instance to your real Solr install. If you need to reindex, just run the import again on your real installation. Your schema for the intermediate Solr install would have stored="true" and indexed="false" for all fields, and would only use basic types like int, long, and string. It would not have any copyFields.
>> 
>> This is the approach used by the Smithsonian for their Solr installation, because getting access to the source databases for the individual entities within the organization is very difficult. This way they can reindex the online Solr at any time without having to get special permission from all those entities. When they index new content, it goes into a copy of Solr configured for storage only, not in-depth searching. Their main Solr instance uses SolrEntityProcessor to import from the intermediate Solr servers, so they can always reindex.
>> 
>> Regards,
>> Hui
>> 
>> -----Original Message-----
>> From: Walter Underwood [mailto:wunder@wunderwood.org]
>> Sent: Thursday, June 09, 2016 12:19 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Questions regarding re-index when using Solr as a data source
>> 
>> First, using Solr as a repository is pretty risky. I would keep the official copy of the data in a database, not in Solr.
>> 
>> Second, you can’t “migrate tables” because Solr doesn’t have tables. You need to turn the tables into documents, then index the documents. It can take a lot of joins to flatten a relational schema into Solr documents.
>> 
>> Solr does not support schema migration, so yes, you will need to save off all the documents, then reload them. I would save them to files. It makes no sense to put them in another copy of Solr.
>> 
>> Changing the schema will be difficult and time-consuming, but you’ll probably run into much worse problems trying to use Solr as a repository.
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org<ma...@wunderwood.org>
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Jun 9, 2016, at 8:50 AM, Hui Liu <hl...@opentext.com>> wrote:
>>> 
>>> Hi,
>>> 
>>>            We are porting an application currently hosted in Oracle 11g to Solr Cloud 6.x, i.e we plan to migrate all tables in Oracle as collections in Solr, index them, and build search tools on top of this; the goal is we won't be using Oracle at all after this has been implemented; every fields in Solr will have 'stored=true' and selectively a subset of searchable fields will have 'indexed=true'; the question is what steps we should follow if we need to re-index a collection after making some schema changes - mostly we only add new fields to store, or make a non-indexed field as indexed, we normally do not delete or rename any existing fields; according to this url: https://wiki.apache.org/solr/HowToReindex it seems we need to setup a 'intermediate' Solr1 to only store the data themselves without any indexing, then have another Solr2 setup to store the indexed data, and in case of re-index, just delete all the documents in Solr2 for the collection and re-import data from Solr1 into Solr2 using SolrEntityProcessor (from dataimport handler)? Is this still the recommended approach? I can see the downside of this approach is if we have tremendous amount of data for a collection (some of our collection could have several billions of documents), re-import it from Solr1 to Solr2 may take a few hours or even days, and during this time, users cannot query the data, is there any better way to do this and avoid this type of down time? Any feedback is appreciated!
>>> 
>>> Regards,
>>> Hui Liu
>>> Opentext, Inc.
>> 
>> 
> 


Re: Questions regarding re-index when using Solr as a data source

Posted by Walter Underwood <wu...@wunderwood.org>.
Solr does not have transactions at all. The “commit” is really “submit batch”.

Solr does not have update. You can add, delete, or replace an entire document.

There is no optimistic concurrency control because there is no concurrency control. Clients can concurrently add documents to a batch, then any client can submit the entire batch.

Replication is not transactional. Replication is a file copy of the underlying indexes (classic) or copying the documents in a batch (Solr Cloud).

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jun 10, 2016, at 7:41 AM, Hui Liu <hl...@opentext.com> wrote:
> 
> Walter,
> 
> 	Thank you for your advice. We are new to Solr and have been using Oracle for past 10+ years, so we are used to the idea of having a tool that can be used as both data store and also searchable by having indexes on top of it. I guess the reason we are considering Solr as data store is due to it has some features of a database that our application requires, such as 1) be able to detect duplicate record by having a unique field; 2) allow us to do concurrent update by using Optimistic concurrency control feature; 3) its 'replication' feature allowing us to store multiple copies of data; so if we were to use a file system, we will not have the above features (at least not 1 and 2) and have to implement those ourselves. The other option is to pick another database tool such as Mysql or Cassandra, then we will need to learn and support an additional tool besides Solr; but you brought up several very good points about operational factors we should consider if we pick Solr as a data store. Also our application is more of a OLTP than OLAP. I will update our colleagues and stakeholders about these concerns. Thanks again!
> 
> Regards,
> Hui
> -----Original Message-----
> From: Walter Underwood [mailto:wunder@wunderwood.org] 
> Sent: Thursday, June 09, 2016 1:24 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Questions regarding re-index when using Solr as a data source
> 
> In the HowToReindex page, under “Using Solr as a Data Store”, it says this: "Don't do this unless you have no other option. Solr is not really designed for this role.” So don’t start by planning to do this.
> 
> Using a second copy of Solr is still using Solr as a repository. That doesn’t satisfy any sort of requirements for disaster recovery. How do you know that data is good? How do you make a third copy? How do you roll back to a previous version? How do you deal with a security breach that affects all your systems? Are the systems in the same data center? How do you deal with ransomware (U. of Calgary paid $20K yesterday)?
> 
> If a consultant suggested this to me, I’d probably just give up and get a different consultant.
> 
> Here is what we do for batch loading.
> 
> 1. For each Solr collection, we define a JSONL feed format, with a JSON Schema.
> 2. The owners of the data write an extractor to pull the data out of wherever it is, then generate the JSON feed.
> 3. We validate the JSON feed against the JSON schema.
> 4. If the feed is valid, we save it to Amazon S3 along with a manifest which lists the version of the JSON Schema.
> 5. Then a multi-threaded loader reads the feed and sends it to Solr.
> 
> Reloading is safe and easy, because all the feeds in S3 are valid.
> 
> Storing backups in S3 instead of running a second Solr is massively cheaper, easier, and safer.
> 
> We also have a clear contract between the content owners and the search team. That contract is enforced by the JSON Schema on every single batch.
> 
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> 
>> On Jun 9, 2016, at 9:51 AM, Hui Liu <hl...@opentext.com> wrote:
>> 
>> Hi Walter,
>> 
>> Thank you for the reply, sorry I need to clarify what I mean by 'migrate tables' from Oracle to Solr, we are not literally move existing records from Oracle to Solr, instead, we are building a new application directly feed data into Solr as document and fields, in parallel of another existing application which feeds the same data into Oracle tables/columns, of course, the Solr schema will be somewhat different than Oracle; also we only keep those data for 90 days for user to search on, we hope once we run both system in parallel for some time (> 90 days), we will build up enough new data in Solr and we no longer need any old data in Oracle, by then we will be able to use Solr as our only data store.
>> 
>> It sounds to me that we may need to consider save the data into either file system, or another database, in case we need to rebuild the indexes; and the reason I mentioned to save data into another Solr system is by reading this info from https://wiki.apache.org/solr/HowToReindex : so just trying to get a feedback on if there is any update on this approach? And any better way to do this to minimize the downtime caused by the schema change and re-index? For example, in Oracle, we are able to add a new column or new index online without any impact of existing queries as existing indexes are intact.
>> 
>> Alternatives when a traditional reindex isn't possible
>> 
>> Sometimes the option of "do your indexing again" is difficult. Perhaps the original data is very slow to access, or it may be difficult to get in the first place.
>> 
>> Here's where we go against our own advice that we just gave you. Above we said "don't use Solr itself as a datasource" ... but one way to deal with data availability problems is to set up a completely separate Solr instance (not distributed, which for SolrCloud means numShards=1) whose only job is to store the data, then use the SolrEntityProcessor in the DataImportHandler to index from that instance to your real Solr install. If you need to reindex, just run the import again on your real installation. Your schema for the intermediate Solr install would have stored="true" and indexed="false" for all fields, and would only use basic types like int, long, and string. It would not have any copyFields.
>> 
>> This is the approach used by the Smithsonian for their Solr installation, because getting access to the source databases for the individual entities within the organization is very difficult. This way they can reindex the online Solr at any time without having to get special permission from all those entities. When they index new content, it goes into a copy of Solr configured for storage only, not in-depth searching. Their main Solr instance uses SolrEntityProcessor to import from the intermediate Solr servers, so they can always reindex.
>> 
>> Regards,
>> Hui
>> 
>> -----Original Message-----
>> From: Walter Underwood [mailto:wunder@wunderwood.org]
>> Sent: Thursday, June 09, 2016 12:19 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Questions regarding re-index when using Solr as a data source
>> 
>> First, using Solr as a repository is pretty risky. I would keep the official copy of the data in a database, not in Solr.
>> 
>> Second, you can’t “migrate tables” because Solr doesn’t have tables. You need to turn the tables into documents, then index the documents. It can take a lot of joins to flatten a relational schema into Solr documents.
>> 
>> Solr does not support schema migration, so yes, you will need to save off all the documents, then reload them. I would save them to files. It makes no sense to put them in another copy of Solr.
>> 
>> Changing the schema will be difficult and time-consuming, but you’ll probably run into much worse problems trying to use Solr as a repository.
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org<ma...@wunderwood.org>
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Jun 9, 2016, at 8:50 AM, Hui Liu <hl...@opentext.com>> wrote:
>>> 
>>> Hi,
>>> 
>>>            We are porting an application currently hosted in Oracle 11g to Solr Cloud 6.x, i.e we plan to migrate all tables in Oracle as collections in Solr, index them, and build search tools on top of this; the goal is we won't be using Oracle at all after this has been implemented; every fields in Solr will have 'stored=true' and selectively a subset of searchable fields will have 'indexed=true'; the question is what steps we should follow if we need to re-index a collection after making some schema changes - mostly we only add new fields to store, or make a non-indexed field as indexed, we normally do not delete or rename any existing fields; according to this url: https://wiki.apache.org/solr/HowToReindex it seems we need to setup a 'intermediate' Solr1 to only store the data themselves without any indexing, then have another Solr2 setup to store the indexed data, and in case of re-index, just delete all the documents in Solr2 for the collection and re-import data from Solr1 into Solr2 using SolrEntityProcessor (from dataimport handler)? Is this still the recommended approach? I can see the downside of this approach is if we have tremendous amount of data for a collection (some of our collection could have several billions of documents), re-import it from Solr1 to Solr2 may take a few hours or even days, and during this time, users cannot query the data, is there any better way to do this and avoid this type of down time? Any feedback is appreciated!
>>> 
>>> Regards,
>>> Hui Liu
>>> Opentext, Inc.
>> 
>> 
> 


RE: Questions regarding re-index when using Solr as a data source

Posted by Hui Liu <hl...@opentext.com>.
Walter,

	Thank you for your advice. We are new to Solr and have been using Oracle for past 10+ years, so we are used to the idea of having a tool that can be used as both data store and also searchable by having indexes on top of it. I guess the reason we are considering Solr as data store is due to it has some features of a database that our application requires, such as 1) be able to detect duplicate record by having a unique field; 2) allow us to do concurrent update by using Optimistic concurrency control feature; 3) its 'replication' feature allowing us to store multiple copies of data; so if we were to use a file system, we will not have the above features (at least not 1 and 2) and have to implement those ourselves. The other option is to pick another database tool such as Mysql or Cassandra, then we will need to learn and support an additional tool besides Solr; but you brought up several very good points about operational factors we should consider if we pick Solr as a data store. Also our application is more of a OLTP than OLAP. I will update our colleagues and stakeholders about these concerns. Thanks again!

Regards,
Hui
-----Original Message-----
From: Walter Underwood [mailto:wunder@wunderwood.org] 
Sent: Thursday, June 09, 2016 1:24 PM
To: solr-user@lucene.apache.org
Subject: Re: Questions regarding re-index when using Solr as a data source

In the HowToReindex page, under “Using Solr as a Data Store”, it says this: "Don't do this unless you have no other option. Solr is not really designed for this role.” So don’t start by planning to do this.

Using a second copy of Solr is still using Solr as a repository. That doesn’t satisfy any sort of requirements for disaster recovery. How do you know that data is good? How do you make a third copy? How do you roll back to a previous version? How do you deal with a security breach that affects all your systems? Are the systems in the same data center? How do you deal with ransomware (U. of Calgary paid $20K yesterday)?

If a consultant suggested this to me, I’d probably just give up and get a different consultant.

Here is what we do for batch loading.

1. For each Solr collection, we define a JSONL feed format, with a JSON Schema.
2. The owners of the data write an extractor to pull the data out of wherever it is, then generate the JSON feed.
3. We validate the JSON feed against the JSON schema.
4. If the feed is valid, we save it to Amazon S3 along with a manifest which lists the version of the JSON Schema.
5. Then a multi-threaded loader reads the feed and sends it to Solr.

Reloading is safe and easy, because all the feeds in S3 are valid.

Storing backups in S3 instead of running a second Solr is massively cheaper, easier, and safer.

We also have a clear contract between the content owners and the search team. That contract is enforced by the JSON Schema on every single batch.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jun 9, 2016, at 9:51 AM, Hui Liu <hl...@opentext.com> wrote:
> 
> Hi Walter,
> 
> Thank you for the reply, sorry I need to clarify what I mean by 'migrate tables' from Oracle to Solr, we are not literally move existing records from Oracle to Solr, instead, we are building a new application directly feed data into Solr as document and fields, in parallel of another existing application which feeds the same data into Oracle tables/columns, of course, the Solr schema will be somewhat different than Oracle; also we only keep those data for 90 days for user to search on, we hope once we run both system in parallel for some time (> 90 days), we will build up enough new data in Solr and we no longer need any old data in Oracle, by then we will be able to use Solr as our only data store.
> 
> It sounds to me that we may need to consider save the data into either file system, or another database, in case we need to rebuild the indexes; and the reason I mentioned to save data into another Solr system is by reading this info from https://wiki.apache.org/solr/HowToReindex : so just trying to get a feedback on if there is any update on this approach? And any better way to do this to minimize the downtime caused by the schema change and re-index? For example, in Oracle, we are able to add a new column or new index online without any impact of existing queries as existing indexes are intact.
> 
> Alternatives when a traditional reindex isn't possible
> 
> Sometimes the option of "do your indexing again" is difficult. Perhaps the original data is very slow to access, or it may be difficult to get in the first place.
> 
> Here's where we go against our own advice that we just gave you. Above we said "don't use Solr itself as a datasource" ... but one way to deal with data availability problems is to set up a completely separate Solr instance (not distributed, which for SolrCloud means numShards=1) whose only job is to store the data, then use the SolrEntityProcessor in the DataImportHandler to index from that instance to your real Solr install. If you need to reindex, just run the import again on your real installation. Your schema for the intermediate Solr install would have stored="true" and indexed="false" for all fields, and would only use basic types like int, long, and string. It would not have any copyFields.
> 
> This is the approach used by the Smithsonian for their Solr installation, because getting access to the source databases for the individual entities within the organization is very difficult. This way they can reindex the online Solr at any time without having to get special permission from all those entities. When they index new content, it goes into a copy of Solr configured for storage only, not in-depth searching. Their main Solr instance uses SolrEntityProcessor to import from the intermediate Solr servers, so they can always reindex.
> 
> Regards,
> Hui
> 
> -----Original Message-----
> From: Walter Underwood [mailto:wunder@wunderwood.org]
> Sent: Thursday, June 09, 2016 12:19 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Questions regarding re-index when using Solr as a data source
> 
> First, using Solr as a repository is pretty risky. I would keep the official copy of the data in a database, not in Solr.
> 
> Second, you can’t “migrate tables” because Solr doesn’t have tables. You need to turn the tables into documents, then index the documents. It can take a lot of joins to flatten a relational schema into Solr documents.
> 
> Solr does not support schema migration, so yes, you will need to save off all the documents, then reload them. I would save them to files. It makes no sense to put them in another copy of Solr.
> 
> Changing the schema will be difficult and time-consuming, but you’ll probably run into much worse problems trying to use Solr as a repository.
> 
> wunder
> Walter Underwood
> wunder@wunderwood.org<ma...@wunderwood.org>
> http://observer.wunderwood.org/  (my blog)
> 
> 
>> On Jun 9, 2016, at 8:50 AM, Hui Liu <hl...@opentext.com>> wrote:
>> 
>> Hi,
>> 
>>             We are porting an application currently hosted in Oracle 11g to Solr Cloud 6.x, i.e we plan to migrate all tables in Oracle as collections in Solr, index them, and build search tools on top of this; the goal is we won't be using Oracle at all after this has been implemented; every fields in Solr will have 'stored=true' and selectively a subset of searchable fields will have 'indexed=true'; the question is what steps we should follow if we need to re-index a collection after making some schema changes - mostly we only add new fields to store, or make a non-indexed field as indexed, we normally do not delete or rename any existing fields; according to this url: https://wiki.apache.org/solr/HowToReindex it seems we need to setup a 'intermediate' Solr1 to only store the data themselves without any indexing, then have another Solr2 setup to store the indexed data, and in case of re-index, just delete all the documents in Solr2 for the collection and re-import data from Solr1 into Solr2 using SolrEntityProcessor (from dataimport handler)? Is this still the recommended approach? I can see the downside of this approach is if we have tremendous amount of data for a collection (some of our collection could have several billions of documents), re-import it from Solr1 to Solr2 may take a few hours or even days, and during this time, users cannot query the data, is there any better way to do this and avoid this type of down time? Any feedback is appreciated!
>> 
>> Regards,
>> Hui Liu
>> Opentext, Inc.
> 
> 


Re: Questions regarding re-index when using Solr as a data source

Posted by Walter Underwood <wu...@wunderwood.org>.
In the HowToReindex page, under “Using Solr as a Data Store”, it says this: "Don't do this unless you have no other option. Solr is not really designed for this role.” So don’t start by planning to do this.

Using a second copy of Solr is still using Solr as a repository. That doesn’t satisfy any sort of requirements for disaster recovery. How do you know that data is good? How do you make a third copy? How do you roll back to a previous version? How do you deal with a security breach that affects all your systems? Are the systems in the same data center? How do you deal with ransomware (U. of Calgary paid $20K yesterday)?

If a consultant suggested this to me, I’d probably just give up and get a different consultant.

Here is what we do for batch loading.

1. For each Solr collection, we define a JSONL feed format, with a JSON Schema.
2. The owners of the data write an extractor to pull the data out of wherever it is, then generate the JSON feed.
3. We validate the JSON feed against the JSON schema.
4. If the feed is valid, we save it to Amazon S3 along with a manifest which lists the version of the JSON Schema.
5. Then a multi-threaded loader reads the feed and sends it to Solr.

Reloading is safe and easy, because all the feeds in S3 are valid.

Storing backups in S3 instead of running a second Solr is massively cheaper, easier, and safer.

We also have a clear contract between the content owners and the search team. That contract is enforced by the JSON Schema on every single batch.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jun 9, 2016, at 9:51 AM, Hui Liu <hl...@opentext.com> wrote:
> 
> Hi Walter,
> 
> Thank you for the reply, sorry I need to clarify what I mean by 'migrate tables' from Oracle to Solr, we are not literally move existing records from Oracle to Solr, instead, we are building a new application directly feed data into Solr as document and fields, in parallel of another existing application which feeds the same data into Oracle tables/columns, of course, the Solr schema will be somewhat different than Oracle; also we only keep those data for 90 days for user to search on, we hope once we run both system in parallel for some time (> 90 days), we will build up enough new data in Solr and we no longer need any old data in Oracle, by then we will be able to use Solr as our only data store.
> 
> It sounds to me that we may need to consider save the data into either file system, or another database, in case we need to rebuild the indexes; and the reason I mentioned to save data into another Solr system is by reading this info from https://wiki.apache.org/solr/HowToReindex : so just trying to get a feedback on if there is any update on this approach? And any better way to do this to minimize the downtime caused by the schema change and re-index? For example, in Oracle, we are able to add a new column or new index online without any impact of existing queries as existing indexes are intact.
> 
> Alternatives when a traditional reindex isn't possible
> 
> Sometimes the option of "do your indexing again" is difficult. Perhaps the original data is very slow to access, or it may be difficult to get in the first place.
> 
> Here's where we go against our own advice that we just gave you. Above we said "don't use Solr itself as a datasource" ... but one way to deal with data availability problems is to set up a completely separate Solr instance (not distributed, which for SolrCloud means numShards=1) whose only job is to store the data, then use the SolrEntityProcessor in the DataImportHandler to index from that instance to your real Solr install. If you need to reindex, just run the import again on your real installation. Your schema for the intermediate Solr install would have stored="true" and indexed="false" for all fields, and would only use basic types like int, long, and string. It would not have any copyFields.
> 
> This is the approach used by the Smithsonian for their Solr installation, because getting access to the source databases for the individual entities within the organization is very difficult. This way they can reindex the online Solr at any time without having to get special permission from all those entities. When they index new content, it goes into a copy of Solr configured for storage only, not in-depth searching. Their main Solr instance uses SolrEntityProcessor to import from the intermediate Solr servers, so they can always reindex.
> 
> Regards,
> Hui
> 
> -----Original Message-----
> From: Walter Underwood [mailto:wunder@wunderwood.org]
> Sent: Thursday, June 09, 2016 12:19 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Questions regarding re-index when using Solr as a data source
> 
> First, using Solr as a repository is pretty risky. I would keep the official copy of the data in a database, not in Solr.
> 
> Second, you can’t “migrate tables” because Solr doesn’t have tables. You need to turn the tables into documents, then index the documents. It can take a lot of joins to flatten a relational schema into Solr documents.
> 
> Solr does not support schema migration, so yes, you will need to save off all the documents, then reload them. I would save them to files. It makes no sense to put them in another copy of Solr.
> 
> Changing the schema will be difficult and time-consuming, but you’ll probably run into much worse problems trying to use Solr as a repository.
> 
> wunder
> Walter Underwood
> wunder@wunderwood.org<ma...@wunderwood.org>
> http://observer.wunderwood.org/  (my blog)
> 
> 
>> On Jun 9, 2016, at 8:50 AM, Hui Liu <hl...@opentext.com>> wrote:
>> 
>> Hi,
>> 
>>             We are porting an application currently hosted in Oracle 11g to Solr Cloud 6.x, i.e we plan to migrate all tables in Oracle as collections in Solr, index them, and build search tools on top of this; the goal is we won't be using Oracle at all after this has been implemented; every fields in Solr will have 'stored=true' and selectively a subset of searchable fields will have 'indexed=true'; the question is what steps we should follow if we need to re-index a collection after making some schema changes - mostly we only add new fields to store, or make a non-indexed field as indexed, we normally do not delete or rename any existing fields; according to this url: https://wiki.apache.org/solr/HowToReindex it seems we need to setup a 'intermediate' Solr1 to only store the data themselves without any indexing, then have another Solr2 setup to store the indexed data, and in case of re-index, just delete all the documents in Solr2 for the collection and re-import data from Solr1 into Solr2 using SolrEntityProcessor (from dataimport handler)? Is this still the recommended approach? I can see the downside of this approach is if we have tremendous amount of data for a collection (some of our collection could have several billions of documents), re-import it from Solr1 to Solr2 may take a few hours or even days, and during this time, users cannot query the data, is there any better way to do this and avoid this type of down time? Any feedback is appreciated!
>> 
>> Regards,
>> Hui Liu
>> Opentext, Inc.
> 
> 


RE: Questions regarding re-index when using Solr as a data source

Posted by Hui Liu <hl...@opentext.com>.
Hi Walter,

Thank you for the reply, sorry I need to clarify what I mean by 'migrate tables' from Oracle to Solr, we are not literally move existing records from Oracle to Solr, instead, we are building a new application directly feed data into Solr as document and fields, in parallel of another existing application which feeds the same data into Oracle tables/columns, of course, the Solr schema will be somewhat different than Oracle; also we only keep those data for 90 days for user to search on, we hope once we run both system in parallel for some time (> 90 days), we will build up enough new data in Solr and we no longer need any old data in Oracle, by then we will be able to use Solr as our only data store.

It sounds to me that we may need to consider save the data into either file system, or another database, in case we need to rebuild the indexes; and the reason I mentioned to save data into another Solr system is by reading this info from https://wiki.apache.org/solr/HowToReindex : so just trying to get a feedback on if there is any update on this approach? And any better way to do this to minimize the downtime caused by the schema change and re-index? For example, in Oracle, we are able to add a new column or new index online without any impact of existing queries as existing indexes are intact.

Alternatives when a traditional reindex isn't possible

Sometimes the option of "do your indexing again" is difficult. Perhaps the original data is very slow to access, or it may be difficult to get in the first place.

Here's where we go against our own advice that we just gave you. Above we said "don't use Solr itself as a datasource" ... but one way to deal with data availability problems is to set up a completely separate Solr instance (not distributed, which for SolrCloud means numShards=1) whose only job is to store the data, then use the SolrEntityProcessor in the DataImportHandler to index from that instance to your real Solr install. If you need to reindex, just run the import again on your real installation. Your schema for the intermediate Solr install would have stored="true" and indexed="false" for all fields, and would only use basic types like int, long, and string. It would not have any copyFields.

This is the approach used by the Smithsonian for their Solr installation, because getting access to the source databases for the individual entities within the organization is very difficult. This way they can reindex the online Solr at any time without having to get special permission from all those entities. When they index new content, it goes into a copy of Solr configured for storage only, not in-depth searching. Their main Solr instance uses SolrEntityProcessor to import from the intermediate Solr servers, so they can always reindex.

Regards,
Hui

-----Original Message-----
From: Walter Underwood [mailto:wunder@wunderwood.org]
Sent: Thursday, June 09, 2016 12:19 PM
To: solr-user@lucene.apache.org
Subject: Re: Questions regarding re-index when using Solr as a data source

First, using Solr as a repository is pretty risky. I would keep the official copy of the data in a database, not in Solr.

Second, you can’t “migrate tables” because Solr doesn’t have tables. You need to turn the tables into documents, then index the documents. It can take a lot of joins to flatten a relational schema into Solr documents.

Solr does not support schema migration, so yes, you will need to save off all the documents, then reload them. I would save them to files. It makes no sense to put them in another copy of Solr.

Changing the schema will be difficult and time-consuming, but you’ll probably run into much worse problems trying to use Solr as a repository.

wunder
Walter Underwood
wunder@wunderwood.org<ma...@wunderwood.org>
http://observer.wunderwood.org/  (my blog)


> On Jun 9, 2016, at 8:50 AM, Hui Liu <hl...@opentext.com>> wrote:
>
> Hi,
>
>              We are porting an application currently hosted in Oracle 11g to Solr Cloud 6.x, i.e we plan to migrate all tables in Oracle as collections in Solr, index them, and build search tools on top of this; the goal is we won't be using Oracle at all after this has been implemented; every fields in Solr will have 'stored=true' and selectively a subset of searchable fields will have 'indexed=true'; the question is what steps we should follow if we need to re-index a collection after making some schema changes - mostly we only add new fields to store, or make a non-indexed field as indexed, we normally do not delete or rename any existing fields; according to this url: https://wiki.apache.org/solr/HowToReindex it seems we need to setup a 'intermediate' Solr1 to only store the data themselves without any indexing, then have another Solr2 setup to store the indexed data, and in case of re-index, just delete all the documents in Solr2 for the collection and re-import data from Solr1 into Solr2 using SolrEntityProcessor (from dataimport handler)? Is this still the recommended approach? I can see the downside of this approach is if we have tremendous amount of data for a collection (some of our collection could have several billions of documents), re-import it from Solr1 to Solr2 may take a few hours or even days, and during this time, users cannot query the data, is there any better way to do this and avoid this type of down time? Any feedback is appreciated!
>
> Regards,
> Hui Liu
> Opentext, Inc.



Re: Questions regarding re-index when using Solr as a data source

Posted by Walter Underwood <wu...@wunderwood.org>.
First, using Solr as a repository is pretty risky. I would keep the official copy of the data in a database, not in Solr.

Second, you can’t “migrate tables” because Solr doesn’t have tables. You need to turn the tables into documents, then index the documents. It can take a lot of joins to flatten a relational schema into Solr documents.

Solr does not support schema migration, so yes, you will need to save off all the documents, then reload them. I would save them to files. It makes no sense to put them in another copy of Solr.

Changing the schema will be difficult and time-consuming, but you’ll probably run into much worse problems trying to use Solr as a repository.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jun 9, 2016, at 8:50 AM, Hui Liu <hl...@opentext.com> wrote:
> 
> Hi,
> 
>              We are porting an application currently hosted in Oracle 11g to Solr Cloud 6.x, i.e we plan to migrate all tables in Oracle as collections in Solr, index them, and build search tools on top of this; the goal is we won't be using Oracle at all after this has been implemented; every fields in Solr will have 'stored=true' and selectively a subset of searchable fields will have 'indexed=true'; the question is what steps we should follow if we need to re-index a collection after making some schema changes - mostly we only add new fields to store, or make a non-indexed field as indexed, we normally do not delete or rename any existing fields; according to this url: https://wiki.apache.org/solr/HowToReindex it seems we need to setup a 'intermediate' Solr1 to only store the data themselves without any indexing, then have another Solr2 setup to store the indexed data, and in case of re-index, just delete all the documents in Solr2 for the collection and re-import data from Solr1 into Solr2 using SolrEntityProcessor (from dataimport handler)? Is this still the recommended approach? I can see the downside of this approach is if we have tremendous amount of data for a collection (some of our collection could have several billions of documents), re-import it from Solr1 to Solr2 may take a few hours or even days, and during this time, users cannot query the data, is there any better way to do this and avoid this type of down time? Any feedback is appreciated!
> 
> Regards,
> Hui Liu
> Opentext, Inc.