You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2006/02/26 22:16:47 UTC

[Solr Wiki] Trivial Update of "CollectionDistribution" by HossMan

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The following page has been changed by HossMan:
http://wiki.apache.org/solr/CollectionDistribution

The comment on the change is:
Solr is not an acronym

------------------------------------------------------------------------------
- = Collection Distribution =
- 
  /!\ :TODO: /!\ update script / config paths based on final packaging strategy.
  
- SOLR distribution is similar in concept to database replication.  All collection changes come to one master SOLR server. All production queries are done against query slaves. Query slaves receive all their collection changes indirectly &#151; as new versions of a collection which they pull from the master.
+ Solr distribution is similar in concept to database replication.  All collection changes come to one master Solr server. All production queries are done against query slaves. Query slaves receive all their collection changes indirectly &#151; as new versions of a collection which they pull from the master.
  These collection downloads are polled for on a cron'd basis.
  
  A collection is a directory of many files.  Collections are distributed to the slaves as snapshots of these files.  Each snapshot is made up of hard links to the files so copying of the actual files is not necessary.  Lucene only ''significantly'' rewrites files following an optimization command.  Generally, a file once written, will change very little if at all.  This makes the underlying transport of rsynch very useful.  Files that have already been transfered and have not changed do not need to be re-transferred with the new edition of a collection.
@@ -14, +12 @@

  == Terminology ==
  
  ||'''Term'''||'''Definition'''||
- ||Collection||A Lucene collection is a directory of files.  These comprise the indexed and returnable data of a SOLR search repository.||
+ ||Collection||A Lucene collection is a directory of files.  These comprise the indexed and returnable data of a Solr search repository.||
  ||Distribution||The copying of a collection from the master to all slaves. Distribution of a collection from the master to all slaves takes advantage of Lucene's index file structure. (This same feature enables fast incremental indexing in Lucene.)||
  ||Inserts and Deletes||As inserts and deletes occur in the collection the directory remains unchanged. Documents are always inserted into newly created files.  Documents that are deleted are not removed from the files. They are flagged in the file, '''deletable''', and are not removed from the files until the collection is '''optimized'''.||
- ||Master & Slave||The Solr distribution system uses the master/slave model.  The master is the service which receives all updates initially and keeps everything organized. SOLR uses a single update master server coupled with multiple query slave servers. All changes (such as inserts, updates, deletes, etc.) are made against the single master server. Changes made on the master are distributed to all the slave servers which service all query requests from the clients.||
+ ||Master & Slave||The Solr distribution system uses the master/slave model.  The master is the service which receives all updates initially and keeps everything organized. Solr uses a single update master server coupled with multiple query slave servers. All changes (such as inserts, updates, deletes, etc.) are made against the single master server. Changes made on the master are distributed to all the slave servers which service all query requests from the clients.||
- ||Update||An update is a single change request against a single SOLR instance.  It may be a request to delete a document, add a new document, change a document, delete all documents matching a query, etc.  Updates are handled synchronously within an individual SOLR instance.||
+ ||Update||An update is a single change request against a single Solr instance.  It may be a request to delete a document, add a new document, change a document, delete all documents matching a query, etc.  Updates are handled synchronously within an individual Solr instance.||
  ||Optimization||A Lucene collection must be optimized periodically to maintain query performance. Optimization is run on the master server only. An optimized index will give you a performance gain at query time of ''at least'' 10%.  This gain may be more on an index that has become fragmented over a period of time with many updates and no optimizations. Optimizations require a '''much''' longer time than does the distribution of an optimized collection to all slaves.  During optimization, a collection is compacted and all segements are merged. New secondary segment(s) are created to contain documents inserted into the collection after it has been optimized.||
  ||Segments||The number of files in a collection.||
  ||mergeFactor||Controls the number of files (segments). For example, when mergeFactor is set to 2, existing documents remain in the main segment, while all updates are written to a single secondary segment.||
@@ -26, +24 @@

  
  == The Snapshot and Distribution Process ==
  
- These are the sequential steps that SOLR runs:
+ These are the sequential steps that Solr runs:
  
-    1. The snapshooter command takes snapshots of the collection on the master.  It runs when invoked by SOLR after it has done a commit or an optimize.
+    1. The snapshooter command takes snapshots of the collection on the master.  It runs when invoked by Solr after it has done a commit or an optimize.
     1. The snappuller command runs on the query slaves to pull the newest snapshot from the master. This is done via rsync in daemon mode running on the master so that it does not need to go through ssh compression, thereby saving a large amount of time and CPU cycles.
-    1. The snapinstaller runs on the slave after a snapshot has been pulled from the master. This signals the local SOLR server to open a new index reader, then  auto-warming of the cache(s) begins (in the new reader), while other requests continue to be served by the original index reader.  Once auto-warming is complete, SOLR retires the old reader and directs all new queries to the newly cache-warmed reader.
+    1. The snapinstaller runs on the slave after a snapshot has been pulled from the master. This signals the local Solr server to open a new index reader, then  auto-warming of the cache(s) begins (in the new reader), while other requests continue to be served by the original index reader.  Once auto-warming is complete, Solr retires the old reader and directs all new queries to the newly cache-warmed reader.
     1. All distribution activity is logged and written back to the master to be viewable on the distribution page of its GUI.
     1.  Old versions of the index are removed from the master and slave servers by a cron'd snapcleaner.
  
@@ -48, +46 @@

     * Taking a snapshot is very fast as well.  
  
  
- == SOLR Distribution Scripts ==
+ == Solr Distribution Scripts ==
  
     * The name of the index directory is defined in the configuration file, web.external.xml. [[BR]] /!\ :TODO: /!\ revise
-    * All Solr collection distribution scripts are installed by the RPM '''solr-tools''' and reside in the directory '''scripts/solr''' of each instance of SOLR. [[BR]] /!\ :TODO: /!\ revise
+    * All Solr collection distribution scripts are installed by the RPM '''solr-tools''' and reside in the directory '''scripts/solr''' of each instance of Solr. [[BR]] /!\ :TODO: /!\ revise
     * Collection distribution scripts create and prepare for distribution a snapshot of a search collection after each '''commit''' and '''optimize''' request.
     * The '''snapshooter''' script creates a directory ''snapshot.&#60;ts&#62;'', where &#60;ts&#62; is a timestamp in the format, yyyymmddHHMMSS.  It contains hard links to the data files.
     * Snapshots are distributed from the master server when the slaves pull them, "smartcopying" the snapshot directory that contains the hard links to the most recent collection data files.  
-    * For usage arguments and syntax see ["SOLRCollectionDistributionScripts"]
+    * For usage arguments and syntax see ["SolrCollectionDistributionScripts"]
  
  ||'''Name'''||'''Description'''||
  ||snapshooter||Creates a snapshot of a collection. Snapshooter takes no arguments as it always applies to the most recent snapshot. Snapshooter runs only on the Master server when a commit happens. Snapshooter can also be run manually.||
  ||snappuller|| A shell script that runs as a cron job on a slave server. The script looks for new snapshots on the master server and pulls them. ||
  ||snappuller-enable||Creates the file, '''snappuller-enabled''', whose presence enables the snappuller.||
- ||snapinstaller||Snapinstaller installs the latest snapshot (determined by the timestamp) into the file, logs/snapshot.current, using hard links (similar to the process of taking a snapshot). Then snapshot.current is written and scp (secure copied) back to the master server. Snapinstaller then triggers SOLR to open a new Searcher.||
+ ||snapinstaller||Snapinstaller installs the latest snapshot (determined by the timestamp) into the file, logs/snapshot.current, using hard links (similar to the process of taking a snapshot). Then snapshot.current is written and scp (secure copied) back to the master server. Snapinstaller then triggers Solr to open a new Searcher.||
  ||snapcleaner||Runs as a cron job to remove snapshot directories more than seven days old. Also can be run manually.||
- ||rsyncd-start||Starts the rsyncd daemon on the master SOLR server which handles collection distribution requests from the slaves.||
+ ||rsyncd-start||Starts the rsyncd daemon on the master Solr server which handles collection distribution requests from the slaves.||
  ||rsyncd daemon|| Efficiently synchronizes a collection&#151;between master and slaves&#151;by copying only the files that actually changed. In addition, rsync can compress data before transmitting it. ||
- ||rsyncd-stop||Stops the rsyncd daemon on the master SOLR server. The stop script then makes sure that the daemon has in fact exited by trying to connect to it for up to 300 seconds. The stop script exits with ''error code 2'' if it fails to stop the rsyncd daemon.||
+ ||rsyncd-stop||Stops the rsyncd daemon on the master Solr server. The stop script then makes sure that the daemon has in fact exited by trying to connect to it for up to 300 seconds. The stop script exits with ''error code 2'' if it fails to stop the rsyncd daemon.||
  ||rsyncd-enable||Creates the file, rsyncd-enabled, whose presence allows the rsyncd daemon to run, allowing replication to occur.||
  ||rsyncd-disable||Removes the file, rsyncd-enabled, whose absence prevents the rsyncd daemon from running, preventing replication. ||
  
@@ -76, +74 @@

         -v          increase verbosity
  }}}
  
- ''snapshooter'' uses the configuration file, '''conf/distribution.conf''', to determine if the SOLR server is a master or slave. snapshooter is disabled on all slave SOLR servers. When invoked on a slave server, snapshooter displays an error message and exits with error code 1 without taking a snapshot.
+ ''snapshooter'' uses the configuration file, '''conf/distribution.conf''', to determine if the Solr server is a master or slave. snapshooter is disabled on all slave Solr servers. When invoked on a slave server, snapshooter displays an error message and exits with error code 1 without taking a snapshot.
  
  === rsyncd-enable ===
  
@@ -85, +83 @@

         -v          increase verbosity
  }}}
  
- ''rsyncd_enable'' enables the starting of the rsyncd daemon by creating the file rsyncd-enabled in the top level directory of the SOLR installation (for example, /var/opt/resin3/7000).  Please note that this script will not actually starts the rsyncd daemon.
+ ''rsyncd_enable'' enables the starting of the rsyncd daemon by creating the file rsyncd-enabled in the top level directory of the Solr installation (for example, /var/opt/resin3/7000).  Please note that this script will not actually starts the rsyncd daemon.
  
  === rsyncd-disable ===
  
@@ -94, +92 @@

         -v          increase verbosity
  }}}
  
- ''rsyncd-disable'' disables the starting of the rsyncd daemon by removing the file, rsyncd-enabled, from the top level directory of the SOLR installation (for example, /var/opt/resin3/7000).  Please note that this script will not actually stop the rsyncd daemon if it is already running.
+ ''rsyncd-disable'' disables the starting of the rsyncd daemon by removing the file, rsyncd-enabled, from the top level directory of the Solr installation (for example, /var/opt/resin3/7000).  Please note that this script will not actually stop the rsyncd daemon if it is already running.
  
  === rsyncd-start ===
  
@@ -102, +100 @@

  usage: rsyncd-start
  }}}
  
- Starts the rsyncd daemon on the master SOLR server. The rsyncd daemon sets its port number to be the port number of the master SOLR server incremented by 10000. For example, if the master SOLR server runs at port 7000, then its  rsyncd daemon runs at port 17000. The start script is synchronous. After starting the rsyncd daemon, it will attempt to connect to it for up to 15 seconds. The start script will exit with error code 2 if it fails to connect to the rsyncd daemon. The configuration of the rsyncd daemon is controlled by the file, conf/rsyncd.conf. The process ID of the rsyncd daemon is written into the file, logs/rsyncd.pid. Output of the rsyncd daemon is written to the file, logs/rsyncd.log.
+ Starts the rsyncd daemon on the master Solr server. The rsyncd daemon sets its port number to be the port number of the master Solr server incremented by 10000. For example, if the master Solr server runs at port 7000, then its  rsyncd daemon runs at port 17000. The start script is synchronous. After starting the rsyncd daemon, it will attempt to connect to it for up to 15 seconds. The start script will exit with error code 2 if it fails to connect to the rsyncd daemon. The configuration of the rsyncd daemon is controlled by the file, conf/rsyncd.conf. The process ID of the rsyncd daemon is written into the file, logs/rsyncd.pid. Output of the rsyncd daemon is written to the file, logs/rsyncd.log.
  
  === rsyncd-stop ===
  
@@ -110, +108 @@

  usage: rsyncd-stop
  }}}
  
- Stops the rsyncd daemon on the master SOLR server. The stop script is synchronous. After stopping the rsyncd daemon, it makes sure that the daemon has exited by trying to connect to it for up to 300 seconds. The stop script will exit with error code 2 if it fails to stop the rsyncd daemon.
+ Stops the rsyncd daemon on the master Solr server. The stop script is synchronous. After stopping the rsyncd daemon, it makes sure that the daemon has exited by trying to connect to it for up to 300 seconds. The stop script will exit with error code 2 if it fails to stop the rsyncd daemon.
  
  === snappuller-enable ===
  
@@ -119, +117 @@

         -v          increase verbosity
  }}}
  
- ''snappuller_enable'' enables the snappuller by creating the file, snappuller-enabled, in the top level directory of the SOLR installation (for example, /var/opt/resin3/7000).
+ ''snappuller_enable'' enables the snappuller by creating the file, snappuller-enabled, in the top level directory of the Solr installation (for example, /var/opt/resin3/7000).
  
  === snappuller-disable ===
  
@@ -128, +126 @@

         -v          increase verbosity
  }}}
  
- ''snappuller-disable'' disables the snappuller by removing the file, snappuller-enabled, from the top level directory of the SOLR installation (for example, /var/opt/resin3/7000).
+ ''snappuller-disable'' disables the snappuller by removing the file, snappuller-enabled, from the top level directory of the Solr installation (for example, /var/opt/resin3/7000).
  
  === snappuller ===
  
@@ -142, +140 @@

         -z          enable compression of data
  }}}
  
- ''snappuller'' gets the hostname and port number of the master SOLR server from the file conf/distribution.conf. The values in conf/distribution.conf are overwritten by the command line options, -m and -p ,if they are present.
+ ''snappuller'' gets the hostname and port number of the master Solr server from the file conf/distribution.conf. The values in conf/distribution.conf are overwritten by the command line options, -m and -p ,if they are present.
  
- If snappuller has been disabled, it will log an appropriate message in its log file, and then exit without pulling any snapshot from the master SOLR server.
+ If snappuller has been disabled, it will log an appropriate message in its log file, and then exit without pulling any snapshot from the master Solr server.
  
- If the name of the snapshot to be pull is not specified by the use of the "-n" option, snappuller will use ssh to determine the name of the most recent snapshot available on the master SOLR server and pull it over if it does not already exist on the slave SOLR server.
+ If the name of the snapshot to be pull is not specified by the use of the "-n" option, snappuller will use ssh to determine the name of the most recent snapshot available on the master Solr server and pull it over if it does not already exist on the slave Solr server.
  
- The status and stats of the current or most recent rsync operation of snappuller is kept in the file, logs/snappuller.status. Whenever this file is updated by snappuller, a copy is scp back to the master SOLR server. See Distribution Status and Stats for more details.
+ The status and stats of the current or most recent rsync operation of snappuller is kept in the file, logs/snappuller.status. Whenever this file is updated by snappuller, a copy is scp back to the master Solr server. See Distribution Status and Stats for more details.
  
  === snapinstaller ===
  
@@ -159, +157 @@

         -v          increase verbosity
  }}}
  
- ''snapinstaller'' gets the hostname and port number of the master SOLR server from the file conf/distribution.conf. The values in conf/distribution.conf are overwritten by the command line options, -m and -p, if they are present.
+ ''snapinstaller'' gets the hostname and port number of the master Solr server from the file conf/distribution.conf. The values in conf/distribution.conf are overwritten by the command line options, -m and -p, if they are present.
  
- After a snapshot has been installed, snapinstaller writes its name into the file, logs/snapshot.current, and scp a copy of this file back to the master SOLR server. See Distribution Status and Stats for more details.
+ After a snapshot has been installed, snapinstaller writes its name into the file, logs/snapshot.current, and scp a copy of this file back to the master Solr server. See Distribution Status and Stats for more details.
  
  === snapcleaner ===
  
@@ -173, +171 @@

         -v           increase verbosity
  }}}
  
- == SOLR Distribution related Cron Jobs ==
+ == Solr Distribution related Cron Jobs ==
  
  The distribution process is automated via the use of cron jobs.
- The cron jobs should run under the user id that the SOLR application is
+ The cron jobs should run under the user id that the Solr application is
  running under.
  
  === snapcleaner ===
  
  ''snapcleaner'' should be run out of cron at the regular basis to clean up
- old snapshots.  This should be done on both the SOLR master and slave servers.  
+ old snapshots.  This should be done on both the Solr master and slave servers.  
  For example, the following cron job runs everyday at midnight and cleans up snapshots 8 days and older:
  
  {{{
@@ -193, +191 @@

  
  === snappuller and snapinstaller ===
  
- On the SOLR slave servers, ''snappuller'' should be run out of cron regularily to get the latest index from the master server.  It is a good idea to also run ''snapinstaller'' with ''snappuller'' back-to-back in the same crontab entry to install the latest index once it has been copied over to the slave.  For example, the following cron job runs every 5 minutes to keep the slave server in sync with the master:
+ On the Solr slave servers, ''snappuller'' should be run out of cron regularily to get the latest index from the master server.  It is a good idea to also run ''snapinstaller'' with ''snappuller'' back-to-back in the same crontab entry to install the latest index once it has been copied over to the slave.  For example, the following cron job runs every 5 minutes to keep the slave server in sync with the master:
  
  {{{
  0,5,10,15,20,25,30,35,40,45,50,55 * * * * /var/opt/resin3/5051/scripts/solr/snappuller;/var/opt/resin3/5051/scripts/solr/snapinstaller