You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2009/10/29 05:21:12 UTC

[Solr Wiki] Trivial Update of "SolrPerformanceFactors" by YonikSeeley

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "SolrPerformanceFactors" page has been changed by YonikSeeley.
http://wiki.apache.org/solr/SolrPerformanceFactors?action=diff&rev1=21&rev2=22

--------------------------------------------------

     * Con: More segment merges slow down indexing.
  
  === HashDocSet Max Size Considerations ===
+ <!> This is only a consideration for Solr 1.3 and earlier.
  
  The hashDocSet is an optimization specified in the solrconfig.xml that enables an int hash representation for filters (docSets) when the number of items in the set is less than maxSize.  For smaller sets, this representation is more memory efficient, more efficient to iterate, and faster to take intersections.
  
@@ -114, +115 @@

  
  Consult the documentation for the application server you are using (ie: !TomCat, Resin, Jetty, etc...) for more information on how to configure page compression.
  
- == Embedded vs HTTP Post ==
+ == Indexing Performance ==
+ You can use [EmbeddedSolr] to do bulk indexing to an embedded instance of Solr and avoid any HTTP overhead.  Most of the overhead is due to latency and can be hidden using multiple threads and sending multiple documents per add request.  SolrJ (the Solr Java client) [[http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html|StreamingUpdateSolrServer]] makes this easy by opening multiple connections to a Solr instance and streaming the added documents over those open connections.
  
+ Other bulk update methods such as the [[UpdateCSV|CSV Loader]] also offer very good performance.
- Using an [EmbeddedSolr] for indexing can be over 50% faster than one using XML messages that are posted.  
- 
- For example it took 2:10:23 to index 3 million records and optimize, while it took 3:21:36 on the same machine to index using HTTP Post with 10 records/post or 2:37:17 with 200 records/post.  If you consider that optimize is only one call, then the difference is slightly bigger.  The machine for these sample numbers was a 3Ghz Pentium 4 desktop machine.
- 
- However the tradeoff is larger records/post requires greater memory footprint.  As the records/post becomes higher it makes more sense to have separate threads for getting records from database/files and another for posting the XML messages to Solr (could also double buffer).  
- 
- See [[http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/package-summary.html|java.util.concurrency javadoc]] for more information on threading.
- 
- Also consider using the [[http://svn.apache.org/repos/asf/lucene/solr/trunk/src/solrj/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.java|StreamingUpdateSolrServer.java]] for bulk update request.  
- 
- 
- == RAM Usage Considerations ==
  
  === OutOfMemoryErrors ===
  
@@ -143, +134 @@

  
  === Memory allocated to the Java VM ===
  
- The easiest way to fight this error, assuming the Java virtual machine isn't already using all your physical memory, is to increase the amount of memory allocated to the Java virtual machine running Solr. To do this for the example/ in the Solr distribution, if you're running the standard Sun virtual machine, you can use the -Xms and -Xmx command-line parameters:
+ The easiest way to fight this error is to increase the amount of memory allocated to the Java virtual machine running Solr. To do this for the example/ in the Solr distribution, if you're running the standard Sun virtual machine, you can use the -Xms and -Xmx command-line parameters:
  
  {{{
  java -Xms512M -Xmx1024M -jar start.jar