You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2006/02/14 09:01:28 UTC
[Solr Wiki] Update of "CollectionBuilding" by HossMan

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The following page has been changed by HossMan:
http://wiki.apache.org/solr/CollectionBuilding

The comment on the change is:
initial import of PI/CollectionBuilding from CNET's wiki

New page:
= Collection Building =

Collection Building is creating a new index from scratch, generally a "manual" effort. A full rebuild in which a new collection replaces the old collection would be required in cases such as the following:
   * When building a new collection with no previous collection data existing. 
   * When launching something new.
   * When a collection has become corrupted to a greater or lesser extent. 
   * When redefining an existing field type&#151;changing your schema in a way that requires a rebuild. For example, merely adding fields to the schema does not require a rebuild, but changing some field types from a simple integer to some exotic type of integer does. 

[[TableOfContents]]

== Recommended Procedure for New Index Building ==

Perform the procedure below from the master server to do collection rebuilds in a production environment. 
   1. Turn off distribution by running '''rsyncd-stop'''. This prevents the slaves from getting data from the master. [[BR]] '''Note:''' Ensure that a distribution is not running when you run rsyncd-stop.
   1. Run the script, '''abc''' (Atomic Backup post-Commit), to create a snapshot for a safe backup.
   1. Delete/clean-out the active directory, '''./index''', on the master server.
   1. Disable incremental updating that might come in while you are performing this procedure. Use an event daemon, or the crontab, for example.
   1. If you have changes to the schema or any new configurations to be installed, stop the server. Make the changes to the schema/configurations and install them.
   1. Restart the server.
   1. Run the index builder. Build time is variable depending upon amount and type of data that you have. You many want to monitor the build if it is a long or complex one.
   1. Run the script, '''abo''' (Atomic Backup post-Optimize), to optimize the collection. [[BR]] '''Note:''' if you know that a large number of incremental updates are still in process from Step 4, wait until they are done before running abo.
   1. Run the '''rsyncd-start''' script to re-enable collection distribution requests from the slaves. The new collection data will be pulled by the slaves while still serving requests. 

=== Alternative Approaches for New Index Building ===

   * Create an "offline" solar port, index from scratch on the offline port, disable snapshot pulling, shut down the master, copy the index from the offline port to the master, enable snapshot pulling.
   * Create an "offline" solar port, index from scratch on the offline port, disable snapshot pulling, shut down the master, copy the index from the offline port to the master, disable slave boxes one-at-a-time and copy the index to each manual, enable snapshot pulling. (This last one in particular reqires a lot more setup time and thought.)

== The Update Schema ==

(Not to be confused with [:SchemaXml:schema.xml].) 

Solr accepts POSTed XML messages that Add/Update, Commit, Delete, and Delete by query, using the url '''/update'''.  Here is the syntax that SOLAR expects to see: 

=== add/update ===

Example:

   {{{
<add>
  <doc>
    <field name="id">05991</field>
    <field name="office">Bridgewater</field>
  </doc>
</add>
}}}

==== Optional attributes for "add" ====
   * `allowDups = "true" | "false"` &#8212; default is "false"
   * `overwritePending = "true" | "false"` &#8212; default is allowDups 
   * `overwriteCommitted = "true"|"false"` &#8212; default is allowDups 
 
The defaults for overwritePending and overwriteCommitted are linked to allowDups such that those defaults make more sense:
   * If allowDups is '''false''' (overwrite any duplicates), it implies that overwritePending and overwriteCommitted are '''true''' by default.
   * If allowDups is '''true''' (allow addition of duplicates), it implies that overwritePending and overwriteCommitted are '''false''' by default.
 
==== Optional attributes on "doc" ====
   * `boost = floating_point_value`  &#8212; default is 1.0 (See Lucene docs for definition of boost.)
 
==== Optional attributes for "field" ====
   * `boost = floating_point_value` &#8212; default is 1.0 (See Lucene docs for definition of boost.)
 

Example of "add" with optional attributes:

   {{{
<add allowDups="false" overwriteCommitted="true" overwritePending="true">
  <doc boost="2.5">
    <field name="id">05991</field>
    <field name="office" boost="2.0">Bridgewater</field>
  </doc>
</add>
}}}

=== "commit" and "optimize" ===

Example:
   {{{
<commit/>
<optimize/>
}}}
 
==== Optional attributes for "commit" and "optimize" ====

   * `waitFlush = "true" | "false"`  &#8212; default is true   &#8212;  block until index changes are flushed to disk  
   * `waitSearcher = "true" | "false"`   &#8212;  default is true  &#8212;  block until a new searcher is opened and registered as the main query searcher, making the changes visible.

Example of "commit" and "optimize" with optional attributes:
   {{{
<commit waitFlush="false" waitSearcher="false"/>
<optimize waitFlush="false" waitSearcher="false"/>
}}}

=== "delete" by ID and by Query ===
Example:
   {{{
<delete><id>05991</id></delete>
<delete><query>office:Bridgewater</query></delete>
}}}

==== Optional attributes for "delete" ====

   * `fromPending = "true" | "false"`  &#8212; default is "true" 
   * `fromCommitted = "true" | "false"`  &#8212; default is "true"
 
Example of "delete" with optional attributes:

   {{{
<delete fromPending="true" fromCommitted="true"><id>05991</id></delete>
<delete fromPending="true" fromCommitted="true"><query>office:Bridgewater</query></delete>
}}}

=== Updating a Data Record via curl ===
You can use curl to send any of the above commands. For example:

{{{
curl http://<hostname>:<port>/update --data-binary '/<add allowDups="false" overwriteCommitted="true" overwritePending="true">
<doc boost="2.5"> <field name="id">05991</field>
<field name="office" boost="2.0">Bridgewater</field> </doc> </add>'
}}}

{{{
curl http://<hostname>:<port>/update --data-binary '<commit waitFlush="false" waitSearcher="false"/>'
}}}

Until a commit has been issued, you will not see any of the data in searches either on the master or the slave. After a commit has been issued, you will see the results on the master, then after a snapshot has been pulled by the slave, you will see it there also.