You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2006/04/08 02:38:30 UTC

[Solr Wiki] Update of "SolrPerformanceFactors" by HossMan

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The following page has been changed by HossMan:
http://wiki.apache.org/solr/SolrPerformanceFactors

The comment on the change is:
loose adaptation of CNET Wiki "SolarPerformanceFactors"

New page:
[[TableOfContents]]

== Schema Design Considerations ==

The number of indexed fields greatly increases the following:

   * Memory usage during indexing
   * Segment merge time
   * Optimization times
   * Index size

These impacts are increased when field norms are used. (ie: `omitNorms="false"`)

== Configuration Considerations ==

=== mergeFactor ===

The mergeFactor roughly determines the number of segments.
 
The mergeFactor value tells Lucene how many documents to store in memory before writing them to disk, as well as how often to merge multiple segments. With the default value of 10, Lucene will store 10 documents in memory before writing them to a single segment on the disk. 

For example, if you set mergeFactor to 10, a new segment will be created on the disk for every 10 documents added to the index. When the 10th segment of size 10 is added, all 10 will be merged into a single segment of size 100. When 10 such segments of size 100 have been added, they will be merged into a single segment containing 1000 documents, and so on. Therefore, at any time, there will be no more than 9 segments in each  index size.
 
These values are set in the *mainIndex* section of solrconfig.xml (disregard the indexDefaults section):   

==== mergeFactor Tradeoffs ====

High value merge factor (e.g., 25):
   * Pro: Generally improves indexing speed
   * Con: Less frequent merges, resulting in a collection with more index files which may slow searching

Low value merge factor (e.g., 2):
   * Pro: Smaller number of index files, which speeds up searching. Index more-up-to-date 
   * Con: More segment merges slow down indexing.

=== HashDocSet Max Size Considerations ===

The hashDocSet is an optimization specified in the solrconfig.xml that enables an int hash representation for filters (docSets) when the number of items in the set is less than maxSize.  For smaller sets, this representation is more memory efficient, more efficient to iterate, and faster to take intersections.

The hashDocSet max size should be based primarliy on the number of documents in the collection -- the larger the number of documents, the larger the hashDocSet max size. You will have to do a bit of trial-and-error to arrive at the optimal number:

   1. Calulate 0.005 of the total number of documents that you are going to store.
   1. Try values on either 'side' of that value to arrive at the best query times. 
   1. When query times seem to plateau, and performance doesn't show much difference between the higher number and the lower, use the higher.

=== Cache autoWarm Count Considerations ===

When a new searcher is opened, its caches may be prepopulated or "autowarmed" with cached object from caches in the old searcher. `autowarmCount` is the number of cached items that will be copied into the new searcher.  You will proably want to base the autowarmCount setting on how long it takes to autowarm. You must consider the trade-off -- time-to-autowarm versus how warm (i.e., autowarmCount) you want the cache to be. The autowarm parameter is set for the caches in solrconfig.xml.

See also the [:SolrCaching:Solr Caching page].

== Optimization Considerations ==

It is highly recommended that you use an optimize index whenever practical -- ie: if you build your index once, and then never modify it.

If your index is recieving a steady stream of modifications, then consider the following factors...

   * When an index goes for too long without being optimized you will likely see query performace degrade.
   * When an index goes for too long without being optimized optimization times will be unpredictable. 
   * An un-optimized index is going to be ''at least'' 10% slower on un-cached objects than on cached objects (and performance will continue to degrade until it reaches a low plateau, then will degrade no more).
   * Auto-warming time will grow if the index gets too large. 
   * The first distribution after an optimization will take longer than subsequent ones. See [:CollectionDistribution:Collection Distribution] for more information.

== Updates and Commit Frequency Tradeoffs ==

If slaves receive new collections too frequently their performance will suffer. In order to avoid this type of degradation you must understand how a  slave receives a collection update so that you can know how to best adjust the relevant parameters (number/frequency of commits, snappullers, and autowarming/autocount) so that new collections do not get installed on slaves too frequently.
 
   1. A snapshot of the collection is taken every time a client runs a commit, or an optimization is run depending on wether `postCommit` or `postOptimize` hooks are used on the master.
   1. Snappullers on the slaves running on a cron'd basis check the master for new snapshots. If the snappullers find a new collection version the slaves pull it down and snapinstall it.
   1. Every time a snapinstall is run on the slave, some autowarming of the cache occurs before Solr hands queries over to that version of the collection. It is crucial to individual query latency that queries have warmed caches.

The three relevant parameters: 

   * The '''number/frequency of snapshots''' is completely up to the indexing client. Therefore, the number of versions of the collection is determined by the client's activity.
   * The '''snappullers''' are cron'd. They could run every second, once a day, or anything in between. When they run, they will retrieve only the most recent collection that they do not have. 
   * '''Cache autowarming''' is configured for each cache in solrconfig.xml. All caches can be autowarmed, with the exception of documentCache. 

If you desire frequent new collections in order for your most recent changes to appear "live online", you must have both frequent commits/snapshots and frequent snappulls. Currently 5 minutes appears to be the most frequently you can commit and snap without loosing cache performance.

Cache autowarming is crucial to performance. On one hand a new cache version must be populated with enough entries so that subsequent queries will be served from the cache after the system switches to the new version of the collection. On the other hand, autowarming (populating) a new collection can take a lot of time, especially since it uses only one thread and one CPU. If your settings fire off snapinstaller more frequently than 5 minutes, then a Solr slave could be in the undesirable condition of handing-off queries to one (old) collection, and, while warming a new collection,  a second  “new” one could be snapped and begin warming! 

If we attempted to solve such a situation, we would have to invalidate the first “new” collection in order to use the second one, then when a “third” new collection would be snapped and warmed, we would have to invalidate the “second” new collection, and so on ad infinitum.  A completely warmed collection would never make it to full term before it was aborted.  This can be prevented with a  properly tuned configuration  so new collections do not get installed too rapidly.  

== Query Response Compression ==

Compressing the Solr XML response before it is sent back to the client is worthwhile in some circumstances. If responses are very large, and NIC i/o limits are encroached, ''and'' Gigabit ethernet is not an option, using compression is a way out.

Compression increases CPU use and since Solr is typically a CPU-bound service, compression ''diminishes'' query performance.  Compression attempts to reduce files to 1/6th original size, and network packets to 1/3rd original size.  (We're not taking the time right now to figure out if the big gap between bits and packets makes sense or not, but suffice it to say it's a nice reduction.)  Query performance is impacted ~15% on the Solr server.

Consult the documentation for the application server you are using (ie: !TomCat, Resin, Jetty, etc...) for more information on how to configure page compression.