You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2013/08/05 15:21:52 UTC

[Solr Wiki] Update of "HowToReindex" by ShawnHeisey

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "HowToReindex" page has been changed by ShawnHeisey:
https://wiki.apache.org/solr/HowToReindex?action=diff&rev1=5&rev2=6

Comment:
reorganized the page and improved the wording.  Nothing new.

  
  == Using Solr as a Data Source ==
  
- Don't use Solr itself as a primary data source unless you have no other option.  It's not really designed for this role.  Every attempt is made to ensure that Solr is stable, but indexes do get corrupted by unanticipated situations, and by things completely outside developer control.  Solr 4.x and later does have NoSQL features, and SolrCloud goes a long way towards high availability, but absolute data reliability in the face of any problem is difficult to achieve for any software, which is why it's always important to have backups.
+ Don't do this unless you have no other option.  Solr is not really designed for this role.  Every attempt is made to ensure that Solr is stable, but indexes do get corrupted by unanticipated situations, and by things completely outside developer control.  Solr 4.x and later does have NoSQL features, and SolrCloud goes a long way towards high availability, but absolute data reliability in the face of any problem is difficult to achieve for any software, which is why it's always important to have backups.
  
- <!> Using Solr as a data source to build a new index is only possible if every single field in your index (except copyField destinations) is stored.  If you have 'stored="false"' on required fields in your schema, you won't be able to recover that data.  It's simply not possible.
+ <!> Using Solr as a data source to build a new index is only possible if you have 'stored="true"' for every single field in your index except any copyField destinations.  If you have 'stored="false"' on required fields in your schema, you won't be able to recover that data.  It's simply not possible.
  
  If you absolutely must use one Solr index as the data source for another index, and you have stored every field, you have a few possible options.
  
@@ -27, +27 @@

    1. http://grokbase.com/t/lucene/solr-user/134p562kxs/export-index-and-re-index-xml
    1. http://www.jason-palmer.com/2011/05/how-to-reindex-a-solr-database/
  
- There is at least one large-scale Solr user that uses separate Solr instances as intermediate data stores because it's difficult to obtain the data from the original source for a reindex.  When they index new content, it goes into a copy of Solr configured for storage only, not in-depth searching.  Their main Solr instance uses SolrEntityProcessor to import from the intermediate Solr servers, so they can always reindex.
- 
  == Alternatives when a traditional reindex isn't possible ==
  
- Sometimes the option of "do your indexing again" is difficult.  Perhaps the original data is very slow to access, or it may be difficult to get in the first place.  One way to deal with this is to set up an additional Solr instance or additional Solr core whose only job is to store the data in a non-distributed index, then use the SolrEntityProcessor in the DataImportHandler to index from that instance to your real Solr install.  If you need to reindex, just run the import again on your real installation.  Your schema for the intermediate Solr install would have stored="true" and indexed="false" for all fields, and would only use basic types like int, tint, and string.  It would not have any copyFields.
+ Sometimes the option of "do your indexing again" is difficult.  Perhaps the original data is very slow to access, or it may be difficult to get in the first place.
  
- This is the approach used by the Smithsonian for their Solr installation, because getting access to the source databases for the individual entities within the organization is very difficult.  This way they can reindex the online Solr at any time without having to get special permission from all those entities.
+ Here's where we go against our own advice that we just gave you.  Above we said "don't use Solr itself as a datasource" ... but one way to deal with data availability problems is to set up a completely separate Solr instance (not distributed, which for SolrCloud means numShards=1) whose only job is to store the data, then use the SolrEntityProcessor in the DataImportHandler to index from that instance to your real Solr install.  If you need to reindex, just run the import again on your real installation.  Your schema for the intermediate Solr install would have stored="true" and indexed="false" for all fields, and would only use basic types like int, long, and string.  It would not have any copyFields.
  
+ This is the approach used by the Smithsonian for their Solr installation, because getting access to the source databases for the individual entities within the organization is very difficult.  This way they can reindex the online Solr at any time without having to get special permission from all those entities.  When they index new content, it goes into a copy of Solr configured for storage only, not in-depth searching.  Their main Solr instance uses SolrEntityProcessor to import from the intermediate Solr servers, so they can always reindex.
+