You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2018/10/08 16:46:54 UTC

[Solr Wiki] Update of "DataImportHandlerFaq" by ShawnHeisey

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "DataImportHandlerFaq" page has been changed by ShawnHeisey:
https://wiki.apache.org/solr/DataImportHandlerFaq?action=diff&rev1=21&rev2=22

Comment:
Added section for database connections timing out and closing due to large merges

  {{{
  <dataSource type="JdbcDataSource" name="ds-2" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost:8889/mysqldatabase" batchSize="-1" user="root" password="root"/>
  }}}
+ 
+ === DataImportHandler imports a few million documents and then errors, with an error indicating it had lost communication with the database ===
+ This happens frequently with MySQL databases, but it could be a problem with any database.
+ 
+ What's happening here is that Solr (at the Lucene layer) has at least three segment merges scheduled, and one of those merges is very large, taking long enough that the database connection is sitting idle, and either the JDBC driver or the server closes the connection.
+ 
+ Solr uses Lucene's ConcurrentMergeScheduler by default.  The maxMergeCount setting for that scheduler controls what happens with merging and indexing.  The merge scheduler will always handle the largest merges on the schedule first.  As long as the number of merges currently being scheduled is below maxMergeCount, ongoing indexing is allowed to run.  If the number of merges scheduled reaches that number (default 3), ongoing indexing is paused until the number drops below what's configured.  The maxMergeCount in the mergeScheduler must be increased, so that more merges can be scheduled while still allowing ongoing indexing.  This config needs to go at the top level of solrconfig.xml:
+ 
+ {{{
+   <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler">
+     <int name="maxThreadCount">1</int>
+     <int name="maxMergeCount">6</int>
+   </mergeScheduler>
+ }}}
+ 
+ If the index is on standard spinning disks, leave maxThreadCount at 1.  If it's on SSD, you can increase it, but probably shouldn't go higher than about 2 or 3.
  
  === I'm using DataImportHandler with MS SQL Server database with sqljdbc driver. DataImportHandler is going out of memory. I tried adjustng the batchSize values but they don't seem to make any difference. How do I fix this? ===
  There's a connection property called responseBuffering in the sqljdbc driver whose default value is "full" which causes the entire result set to be fetched. See http://msdn.microsoft.com/en-us/library/ms378988.aspx for more details. You can set this property to "adaptive" to keep the driver from getting everything into memory. Connection properties like this can be set as an attribute (responseBuffering="adaptive") in the dataSource configuration OR directly in the jdbc url specified in DataImportHandler's dataSource configuration.
@@ -20, +36 @@

  
   * Add your property in the invariant section of solrconfig's DataImportHandler element. For example, add this section:
   * <lst name="invariants"> <str name="xmlDataDir">${xmlDataDir}</str> </lst>
-  * Use it as ${dataimporter.request.xmlDataDir} in your data-config to access this. 
+  * Use it as ${dataimporter.request.xmlDataDir} in your data-config to access this.
   * [[http://markmail.org/message/svlmsygl3sss4f67|see the mail thread]]
  
  === How would I insert a static value into a field ? ===
@@ -46, +62 @@

      <document>
            <entity name="findDelta" query="select id from item where id in
                                  (select item_id as id from feature where last_modified > '${dataimporter.last_index_time}')
-                                 or id in 
+                                 or id in
-                                 (select item_id as id from item_category where item_id in 
+                                 (select item_id as id from item_category where item_id in
                                      (select id as item_id from category where last_modified > '${dataimporter.last_index_time}')
                                  or last_modified > '${dataimporter.last_index_time}')
                                  or last_modified > '${dataimporter.last_index_time}'" rootEntity="false">
              <entity name="item" query="select * from item where id='${findDelta.id}'>
-             <entity name="feature" query="select description as features from feature where item_id='${item.ID}'">                
+             <entity name="feature" query="select description as features from feature where item_id='${item.ID}'">
              </entity>
              <entity name="item_category" query="select CATEGORY_ID from item_category where ITEM_ID='${item.ID}'">
-                 <entity name="category" query="select description as cat from category where id = '${item_category.CATEGORY_ID}'">                    
+                 <entity name="category" query="select description as cat from category where id = '${item_category.CATEGORY_ID}'">
                  </entity>
              </entity>
          </entity>
@@ -65, +81 @@

  }}}
  
  notice that for the entity 'findDelta' rootEntity is set to false, so a document is created for each row from 'item'.
- The command has to be 
+ The command has to be
  
  command=full-import&clean=false