You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2011/07/21 21:16:32 UTC

[Solr Wiki] Update of "DataImportHandler" by FrankWesemann

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "DataImportHandler" page has been changed by FrankWesemann:
http://wiki.apache.org/solr/DataImportHandler?action=diff&rev1=289&rev2=290

Comment:
added clarification about logLevels supported by LogTransformer

  <<TableOfContents>>
  
  = Overview =
- 
  == Goals ==
   * Read data residing in relational databases
   * Build Solr documents by aggregating data from multiple columns and tables according to configuration
@@ -23, +22 @@

  
  = Design Overview =
  The Handler has to be registered in the solrconfig.xml as follows.
+ 
  {{{
    <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
      <lst name="defaults">
@@ -30, +30 @@

      </lst>
    </requestHandler>
  }}}
- 
- 
  As the name suggests, this is implemented as a SolrRequestHandler. The configuration is provided in two places:
  
   * solrconfig.xml . The data config file location is added here
   * The datasource also can be added here. Or it can be put directly into the data-config.xml
   * data-config.xml
-    * How to fetch data (queries,url etc)
+   * How to fetch data (queries,url etc)
-    * What to read ( resultset columns, xml fields etc)
+   * What to read ( resultset columns, xml fields etc)
-    * How to process (modify/add/remove fields)
+   * How to process (modify/add/remove fields)
+ 
  = Usage with RDBMS =
  In order to use this handler, the following steps are required.
+ 
   * Define a data-config.xml and specify the location this file in solrconfig.xml under DataImportHandler section
   * Give connection information (if you choose to put the datasource information in solrconfig)
-  * Open the DataImportHandler page to verify if everything is in order [[http://localhost:8983/solr/dataimport]]
+  * Open the DataImportHandler page to verify if everything is in order http://localhost:8983/solr/dataimport
   * Use full-import command to do a full import from the database and add to Solr index
   * Use delta-import command to do a delta import (get new inserts/updates) and add to Solr index
  
@@ -52, +52 @@

  
  == Configuring DataSources ==
  Add the tag 'dataSource' directly under the 'dataConfig' tag.
+ 
  {{{
  <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/dbname" user="db_username" password="db_password"/>
  }}}
@@ -60, +61 @@

   * The attribute 'name' can be used if there are [[#multipleds|multiple datasources]] used by multiple entities
   * All other attributes in the <dataSource> tag are specific to the particular dataSource implementation being configured.
   * [[#datasource|See here]] for plugging in your own
+ 
  <<Anchor(multipleds)>>
+ 
  === Multiple DataSources ===
  It is possible to have more than one datasources for a configuration. To configure an extra datasource , just keep an another 'dataSource'  tag . There is an implicit attribute "name" for a datasource. If there are more than one, each extra datasource must be identified by a unique name  `'name="datasource-2"'` .
  
  eg:
+ 
  {{{
  <dataSource type="JdbcDataSource" name="ds-1" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://db1-host/dbname" user="db_username" password="db_password"/>
  <dataSource type="JdbcDataSource" name="ds-2" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://db2-host/dbname" user="db_username" password="db_password"/>
  }}}
  in your entities:
+ 
  {{{
  ..
  <entity name="one" dataSource="ds-1" ...>
@@ -81, +86 @@

  ..
  }}}
  <<Anchor(jdbcdatasource)>>
+ 
  == Configuring JdbcDataSource ==
  The attributes accepted by !JdbcDataSource are ,
+ 
   * '''`driver`''' (required): The jdbc driver classname
   * '''`url`''' (required) : The jdbc connection url
   * '''`user`''' : User name
@@ -93, +100 @@

   * '''`readOnly`''' : If this is set to 'true' , it sets `setReadOnly(true)`, `setAutoCommit(true)`, `setTransactionIsolation(TRANSACTION_READ_UNCOMMITTED)`,`setHoldability(CLOSE_CURSORS_AT_COMMIT)` on the connection <!> [[Solr1.4]]
   * '''`transactionIsolation`''' : The possible values are [TRANSACTION_READ_UNCOMMITTED, TRANSACTION_READ_COMMITTED, TRANSACTION_REPEATABLE_READ,TRANSACTION_SERIALIZABLE,TRANSACTION_NONE] <!> [[Solr1.4]]
  
- 
  Any extra attributes put into the tag are directly passed on to the jdbc driver.
  
  == Configuration in data-config.xml ==
@@ -104, +110 @@

  In order to get data from the database, our design philosophy revolves around 'templatized sql' entered by the user for each entity. This gives the user the entire power of SQL if he needs it. The root entity is the central table whose columns can be used to join this table with other child entities.
  
  === Schema for the data config ===
-   The dataconfig does not have a rigid schema. The attributes in the entity/field are arbitrary and depends on the `processor` and `transformer`.
+  . The dataconfig does not have a rigid schema. The attributes in the entity/field are arbitrary and depends on the `processor` and `transformer`.
+ 
  The default attributes for an entity are:
+ 
   * '''`name`''' (required) : A unique name used to identify an entity
   * '''`processor`''' : Required only if the datasource is not RDBMS . (The default value is `SqlEntityProcessor`)
   * '''`transformer`'''  : Transformers to be applied on this entity. (See the transformer section)
   * '''`dataSource`''' : The name of a datasource as put in the the datasource .(Used if there are multiple datasources)
-  * '''`threads`''' :  The no:of of threads to use to run this entity. This must be placed on or above a 'rootEntity'. [[Solr1.5]] 
+  * '''`threads`''' :  The no:of of threads to use to run this entity. This must be placed on or above a 'rootEntity'. [[Solr1.5]]
-  * '''`pk`''' : The primary key for the entity. It is '''optional''' and only needed when using delta-imports. It has no relation to the uniqueKey defined in schema.xml but they both can be the same. 
+  * '''`pk`''' : The primary key for the entity. It is '''optional''' and only needed when using delta-imports. It has no relation to the uniqueKey defined in schema.xml but they both can be the same.
   * '''`rootEntity`''' : By default the entities falling under the document are root entities. If it is set to false , the entity directly falling under that entity will be treated as the root entity (so on and so forth). For every row returned by the root entity a document is created in Solr
   * '''`onError`''' : (abort|skip|continue) . The default value is 'abort' . 'skip' skips the current document. 'continue' continues as if the error did not happen . <!> [[Solr1.4]]
   * '''`preImportDeleteQuery`''' : before full-import this will be used to cleanup the index instead of using '*:*' .This is honored only on an entity that is an immediate sub-child of <document> <!> [[Solr1.4]].
   * '''`postImportDeleteQuery`''' : after full-import this will be used to cleanup the index <!>. This is honored only on an entity that is an immediate sub-child of <document> [[Solr1.4]].
+ 
  For !SqlEntityProcessor the entity attributes are :
  
   * '''`query`''' (required) : The sql string using which to query the db
@@ -124, +133 @@

   * '''`deletedPkQuery`''' : Only used in delta-import
   * '''`deltaImportQuery`''' : (Only used in delta-import) . If this is not present , DIH tries to construct the import query by(after identifying the delta) modifying the '`query`' (this is error prone). There is a namespace `${dataimporter.delta.<column-name>}` which can be used in this query.  e.g: `select * from tbl where id=${dataimporter.delta.id}`  <!> [[Solr1.4]].
  
- 
  == Commands ==
- <<Anchor(commands)>>
- The handler exposes all its API as http requests . The following are the possible operations
+ <<Anchor(commands)>> The handler exposes all its API as http requests . The following are the possible operations
  
   * '''full-import''' : Full Import operation can be started by hitting the URL `http://<host>:<port>/solr/dataimport?command=full-import`
    * This operation will be started in a new thread and the ''status'' attribute in the response should be shown ''busy'' now.
@@ -148, +155 @@

   * '''abort''' : Abort an ongoing operation by hitting the url `http://<host>:<port>/solr/dataimport?command=abort`
  
  == Full Import Example ==
- 
  Let us consider an example. Suppose we have the following schema in our database
  
  {{attachment:example-schema.png}}
@@ -156, +162 @@

  This is a relational model of the same schema that Solr currently ships with. We will use this as an example to build a data-config.xml for DataImportHandler. We've created a sample database with this schema using [[http://hsqldb.org/|HSQLDB]].  To run it, do the following steps:
  
   1. Look at the example/example-DIH directory in the solr download. It contains a complete solr home with all the configuration you need to execute this as well as the RSS example (given later in this page).
-  2. Use the ''example-DIH/solr'' directory as your solr home.  Start Solr by running from the root {{{/examples}}} directory: {{{java -Dsolr.solr.home="./example-DIH/solr/" -jar start.jar}}}
+  1. Use the ''example-DIH/solr'' directory as your solr home.  Start Solr by running from the root {{{/examples}}} directory: {{{java -Dsolr.solr.home="./example-DIH/solr/" -jar start.jar}}}
-  3. Hit [[http://localhost:8983/solr/db/dataimport]] with a browser to verify the configuration.
+  1. Hit http://localhost:8983/solr/db/dataimport with a browser to verify the configuration.
-  4. Hit [[http://localhost:8983/solr/db/dataimport?command=full-import]] to do a full import.
+  1. Hit http://localhost:8983/solr/db/dataimport?command=full-import to do a full import.
  
  The ''solr'' directory is a MultiCore Solr home. It has two cores, one for the DB example (this one) and one for an RSS example (new feature).
  
@@ -190, +196 @@

      </document>
  </dataConfig>
  }}}
- 
  Here, the root entity is a table called "item" whose primary key is a column "id". Data can be read from this table with the query "select * from item". Each item can have multiple "features" which are in the table ''feature'' inside the column ''description''. Note the query in ''feature'' entity:
+ 
  {{{
     <entity name="feature" query="select description from feature where item_id='${item.id}'">
         <field name="feature" column="description" />
@@ -207, +213 @@

              </entity>
  }}}
  <<Anchor(shortconfig)>>
+ 
  === A shorter data-config ===
  In the above example, there are mappings of fields to Solr fields. It is possible to totally avoid the field entries in entities if the names of the fields are same (case does not matter) as those in Solr schema. You may need to add a field entry if any of the built-in Tranformers are used (see Transformer section)
  
  The shorter version is given below
+ 
  {{{
  <dataConfig>
      <dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa" />
@@ -224, +232 @@

      </document>
  </dataConfig>
  }}}
- 
  == Using delta-import command ==
- Delta Import operation can be started by hitting the URL [[http://localhost:8983/solr/dataimport?command=delta-import]]. This operation will be started in a new thread and the ''status'' attribute in the response should be shown ''busy'' now. Depending on the size of your data set, this operation may take some time. At any time, you can hit [[http://localhost:8983/solr/dataimport]] to see the status flag.
+ Delta Import operation can be started by hitting the URL http://localhost:8983/solr/dataimport?command=delta-import. This operation will be started in a new thread and the ''status'' attribute in the response should be shown ''busy'' now. Depending on the size of your data set, this operation may take some time. At any time, you can hit http://localhost:8983/solr/dataimport to see the status flag.
  
  When delta-import command is executed, it reads the start time stored in ''conf/dataimport.properties''. It uses that timestamp to run delta queries and after completion, updates the timestamp in ''conf/dataimport.properties''.
  
@@ -244, +251 @@

          <entity name="item" pk="ID"
                  query="select * from item"
                  deltaImportQuery="select * from item where ID='${dataimporter.delta.id}'"
- 		deltaQuery="select id from item where last_modified &gt; '${dataimporter.last_index_time}'">
+                 deltaQuery="select id from item where last_modified &gt; '${dataimporter.last_index_time}'">
              <entity name="feature" pk="ITEM_ID"
                      query="select description as features from feature where item_id='${item.ID}'">
              </entity>
@@ -258, +265 @@

      </document>
  </dataConfig>
  }}}
+ Pay attention to the ''deltaQuery'' attribute which has an SQL statement capable of detecting changes in the ''item'' table. Note the variable {{{${dataimporter.last_index_time}}}} The DataImportHandler exposes a variable called ''last_index_time'' which is a timestamp value denoting the last time ''full-import'' ''''or'''' ''delta-import'' was run. You can use this variable anywhere in the SQL you write in data-config.xml and it will be replaced by the value during processing.
- 
- Pay attention to the ''deltaQuery'' attribute which has an SQL statement capable of detecting changes in the ''item'' table. Note the variable {{{${dataimporter.last_index_time}}}}
- The DataImportHandler exposes a variable called ''last_index_time'' which is a timestamp value denoting the last time ''full-import'' ''''or'''' ''delta-import'' was run. You can use this variable anywhere in the SQL you write in data-config.xml and it will be replaced by the value during processing.
- 
  
  /!\ Note
+ 
   * The deltaQuery in the above example only detects changes in ''item'' but not in other tables. You can detect the changes to all child tables in one SQL query as specified below. Figuring out it's details is an exercise for the user :)
+ 
  {{{
- 	deltaQuery="select id from item where id in
+         deltaQuery="select id from item where id in
- 				(select item_id as id from feature where last_modified > '${dataimporter.last_index_time}')
+                                 (select item_id as id from feature where last_modified > '${dataimporter.last_index_time}')
- 				or id in
+                                 or id in
- 				(select item_id as id from item_category where item_id in
+                                 (select item_id as id from item_category where item_id in
- 				    (select id as item_id from category where last_modified > '${dataimporter.last_index_time}')
+                                     (select id as item_id from category where last_modified > '${dataimporter.last_index_time}')
- 				or last_modified &gt; '${dataimporter.last_index_time}')
+                                 or last_modified &gt; '${dataimporter.last_index_time}')
- 				or last_modified &gt; '${dataimporter.last_index_time}'"
+                                 or last_modified &gt; '${dataimporter.last_index_time}'"
  }}}
   * Writing a huge deltaQuery like the above one is not a very enjoyable task, so we have an alternate mechanism of achieving this goal.
+ 
  {{{
  <dataConfig>
      <dataSource driver="org.hsqldb.jdbcDriver" url="jdbc:hsqldb:/temp/example/ex" user="sa" />
      <document>
- 	    <entity name="item" pk="ID" query="select * from item"
+             <entity name="item" pk="ID" query="select * from item"
                  deltaImportQuery="select * from item where ID=='${dataimporter.delta.id}'"
- 		deltaQuery="select id from item where last_modified &gt; '${dataimporter.last_index_time}'">
+                 deltaQuery="select id from item where last_modified &gt; '${dataimporter.last_index_time}'">
                  <entity name="feature" pk="ITEM_ID"
- 		    query="select DESCRIPTION as features from FEATURE where ITEM_ID='${item.ID}'"
+                     query="select DESCRIPTION as features from FEATURE where ITEM_ID='${item.ID}'"
- 		    deltaQuery="select ITEM_ID from FEATURE where last_modified > '${dataimporter.last_index_time}'"
+                     deltaQuery="select ITEM_ID from FEATURE where last_modified > '${dataimporter.last_index_time}'"
- 		    parentDeltaQuery="select ID from item where ID=${feature.ITEM_ID}"/>
+                     parentDeltaQuery="select ID from item where ID=${feature.ITEM_ID}"/>
  
  
- 	    <entity name="item_category" pk="ITEM_ID, CATEGORY_ID"
+             <entity name="item_category" pk="ITEM_ID, CATEGORY_ID"
- 		    query="select CATEGORY_ID from item_category where ITEM_ID='${item.ID}'"
+                     query="select CATEGORY_ID from item_category where ITEM_ID='${item.ID}'"
- 		    deltaQuery="select ITEM_ID, CATEGORY_ID from item_category where last_modified > '${dataimporter.last_index_time}'"
+                     deltaQuery="select ITEM_ID, CATEGORY_ID from item_category where last_modified > '${dataimporter.last_index_time}'"
- 		    parentDeltaQuery="select ID from item where ID=${item_category.ITEM_ID}">
+                     parentDeltaQuery="select ID from item where ID=${item_category.ITEM_ID}">
                  <entity name="category" pk="ID"
- 			query="select DESCRIPTION as cat from category where ID = '${item_category.CATEGORY_ID}'"
+                         query="select DESCRIPTION as cat from category where ID = '${item_category.CATEGORY_ID}'"
- 			deltaQuery="select ID from category where last_modified &gt; '${dataimporter.last_index_time}'"
+                         deltaQuery="select ID from category where last_modified &gt; '${dataimporter.last_index_time}'"
- 			parentDeltaQuery="select ITEM_ID, CATEGORY_ID from item_category where CATEGORY_ID=${category.ID}"/>
+                         parentDeltaQuery="select ITEM_ID, CATEGORY_ID from item_category where CATEGORY_ID=${category.ID}"/>
- 	    </entity>
+             </entity>
          </entity>
      </document>
  </dataConfig>
  }}}
- 
  Here we have three queries specified for each entity except the root (which has only two).
+ 
   * The ''query'' gives the data needed to populate fields of the Solr document in full-import
   * The ''deltaImportQuery'' gives the data needed to populate fields when running a delta-import
   * The ''deltaQuery'' gives the primary keys of the current entity which have changes since the last index time
   * The ''parentDeltaQuery'' uses the changed rows of the current table (fetched with deltaQuery) to give the changed rows in the parent table. This is necessary because whenever a row in the child table changes, we need to re-generate the document which has that field.
  
  Let us reiterate on the findings:
+ 
   * For each row given by ''query'', the query of the child entity is executed once.
   * For each row given by ''deltaQuery'', the parentDeltaQuery is executed.
   * If any row in the root/child entity changes, we regenerate the complete Solr document which contained that row.
  
+ /!\ Note :  The 'deltaImportQuery' is a Solr 1.4 feature. Originally it was generated automatically using the 'query' attribute which is error prone. /!\ Note : It is possible to do delta-import using a full-import command . [[http://wiki.apache.org/solr/DataImportHandlerFaq#fullimportdelta|See here]]
- /!\ Note :  The 'deltaImportQuery' is a Solr 1.4 feature. Originally it was generated automatically using the 'query' attribute which is error prone.
- /!\ Note : It is possible to do delta-import using a full-import command . [[http://wiki.apache.org/solr/DataImportHandlerFaq#fullimportdelta|See here]]
  
  <!> [[Solr3.1]] The handler checks to make sure that your declared primary key field is in the results of all queries.  In one instance, this required using an SQL alias when upgrading from 1.4 to 3.1, with a primary key field of "did":
+ 
-  {{{
+  . {{{
   deltaQuery="SELECT MAX(did) FROM ${dataimporter.request.dataView}"
-  }}}
+ }}}
   Changed to:
   {{{
   deltaQuery="SELECT MAX(did) AS did FROM ${dataimporter.request.dataView}"
-  }}}
+ }}}
  
  = Usage with XML/HTTP Datasource =
  DataImportHandler can be used to index data from HTTP based data sources. This includes using indexing from REST/XML APIs as well as from RSS/ATOM Feeds.
@@ -331, +339 @@

  <<Anchor(httpds)>>
  
  == Configuration of URLDataSource or HttpDataSource ==
- 
  <!> !HttpDataSource is being deprecated in favour of URLDataSource in [[Solr1.4]]
  
  Sample configurations for URLDataSource <!> [[Solr1.4]] and !HttpDataSource in data config xml look like this
+ 
  {{{
  <dataSource name="b" type="!HttpDataSource" baseUrl="http://host:port/" encoding="UTF-8" connectionTimeout="5000" readTimeout="10000"/>
  <!-- or in Solr 1.4-->
@@ -348, +356 @@

   * '''`readTimeout`''' (optional): the default value is 10000ms
  
  == Configuration in data-config.xml ==
- 
  The entity for an xml/http data source can have the following attributes over and above the default attributes
+ 
   * '''`processor`''' (required) : The value must be `"XPathEntityProcessor"`
   * '''`url`''' (required) : The url used to invoke the REST API. (Can be templatized). if the data souce is file this must be the file location
   * '''`stream`''' (optional) : set this to true , if the xml is really big
@@ -358, +366 @@

   * '''`useSolrAddSchema`'''(optional): Set it's value to 'true' if the xml that is fed into this processor has the same schema as that of the solr add xml. No need to mention any fields if it is set to true.
   * '''`flatten`''' (optional) : If this is set to true, text from under all the tags are extracted into one field , irrespective of the tag name. <!> [[Solr1.4]]
  
- 
  The entity fields can have the following attributes (over and above the default attributes):
+ 
   * '''`xpath`''' (optional) : The xpath expression of the field to be mapped as a column in the record . It can be omitted if the column does not come from an xml attribute (is a synthetic field created by a transformer). If a field is marked as multivalued in the schema and in a given row of the xpath finds multiple values it is handled automatically by the XPathEntityProcessor. No extra configuration is required
  
   * '''`commonField`''' : can be (true| false) . If true, this field once encountered in a record will be copied to other records before creating a Solr document
  
- If an API supports chunking (when the dataset is too large) multiple calls need to be made to complete the process.
- XPathEntityprocessor supports this with a transformer. If transformer returns a row which contains a field '''`$hasMore`''' with a the value `"true"` the Processor makes another request with the same url template (The actual value is recomputed before invoking ). A transformer can pass a totally new url too for the next call by returning a row which contains a field '''`$nextUrl`''' whose value must be the complete url for the next call.
+ If an API supports chunking (when the dataset is too large) multiple calls need to be made to complete the process. XPathEntityprocessor supports this with a transformer. If transformer returns a row which contains a field '''`$hasMore`''' with a the value `"true"` the Processor makes another request with the same url template (The actual value is recomputed before invoking ). A transformer can pass a totally new url too for the next call by returning a row which contains a field '''`$nextUrl`''' whose value must be the complete url for the next call.
  
  The XPathEntityProcessor implements a streaming parser which supports a subset of xpath syntax. Complete xpath syntax is not supported but most of the common use cases are covered as follows:-
+ 
  {{{
     xpath="/a/b/subject[@qualifier='fullTitle']"
     xpath="/a/b/subject/@qualifier"
@@ -375, +383 @@

     xpath="//a/..."
     xpath="/a//b..."
  }}}
- 
- 
  == HttpDataSource Example ==
  <!> !HttpDataSource is being deprecated in favour of URLDataSource in [[Solr1.4]]
  
  Download the full import example given in the DB section to try this out. We'll try indexing the [[http://rss.slashdot.org/Slashdot/slashdot|Slashdot RSS feed]] for this example.
  
- 
  The data-config for this example looks like this:
+ 
  {{{
  <dataConfig>
          <dataSource type="HttpDataSource" />
- 	<document>
- 		<entity name="slashdot"
- 			pk="link"
+         <document>
+                 <entity name="slashdot"
+                         pk="link"
- 			url="http://rss.slashdot.org/Slashdot/slashdot"
+                         url="http://rss.slashdot.org/Slashdot/slashdot"
- 			processor="XPathEntityProcessor"
- 			forEach="/RDF/channel | /RDF/item"
- 			transformer="DateFormatTransformer">
+                         processor="XPathEntityProcessor"
+                         forEach="/RDF/channel | /RDF/item"
+                         transformer="DateFormatTransformer">
  
- 			<field column="source"       xpath="/RDF/channel/title"   commonField="true" />
+                         <field column="source"       xpath="/RDF/channel/title"   commonField="true" />
- 			<field column="source-link"  xpath="/RDF/channel/link"    commonField="true" />
+                         <field column="source-link"  xpath="/RDF/channel/link"    commonField="true" />
- 			<field column="subject"      xpath="/RDF/channel/subject" commonField="true" />
+                         <field column="subject"      xpath="/RDF/channel/subject" commonField="true" />
  
- 			<field column="title"        xpath="/RDF/item/title" />
+                         <field column="title"        xpath="/RDF/item/title" />
- 			<field column="link"         xpath="/RDF/item/link" />
+                         <field column="link"         xpath="/RDF/item/link" />
- 			<field column="description"  xpath="/RDF/item/description" />
+                         <field column="description"  xpath="/RDF/item/description" />
- 			<field column="creator"      xpath="/RDF/item/creator" />
+                         <field column="creator"      xpath="/RDF/item/creator" />
- 			<field column="item-subject" xpath="/RDF/item/subject" />
+                         <field column="item-subject" xpath="/RDF/item/subject" />
  
- 			<field column="slash-department" xpath="/RDF/item/department" />
+                         <field column="slash-department" xpath="/RDF/item/department" />
- 			<field column="slash-section"    xpath="/RDF/item/section" />
+                         <field column="slash-section"    xpath="/RDF/item/section" />
- 			<field column="slash-comments"   xpath="/RDF/item/comments" />
+                         <field column="slash-comments"   xpath="/RDF/item/comments" />
- 			<field column="date" xpath="/RDF/item/date" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" />
+                         <field column="date" xpath="/RDF/item/date" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" />
- 		</entity>
- 	</document>
+                 </entity>
+         </document>
  </dataConfig>
  }}}
- 
  This data-config is where the action is. If you read the structure of the Slashdot RSS, it has a few header elements such as title, link and subject. Those are mapped to the Solr fields source, source-link and subject respectively using xpath syntax. The feed also has multiple ''item'' elements which contain the actual news items. So, what we wish to do is , create a document in Solr for each 'item'.
  
  The XPathEntityprocessor is designed to stream the xml, row by row (Think of a row as various fields in a xml element ). It uses the ''forEach'' attribute to identify a 'row'. In this example forEach has the value `'/RDF/channel | /RDF/item'` . This says that this xml has two types of rows (This uses the xpath syntax for OR and there can be more than one type of rows) . After it encounters a row , it tries to read as many fields are there in the field declarations. So in this case, when it reads the row `'/RDF/channel'` it may get 3 fields 'source', 'source-link' , 'source-subject' . After it processes the row it realizes that it does not have any value for the 'pk' field so it does not try to create a Solr document for this row (Even if it tries it may fail in solr). But all these 3 fields are marked as `commonField="true"` . So it keeps the values handy for subsequent rows.
@@ -427, +432 @@

  /!\ Note : Unlike with database , it is not possible to omit the field declarations if you are using XPathEntityProcessor. It relies on the xpaths declared in the fields to identify what to extract from the xml.
  
  <<Anchor(wikipedia)>>
+ 
  == Example: Indexing wikipedia ==
  The following data-config.xml was used to index a full (en-articles, recent only) [[http://download.wikimedia.org/enwiki/20080724/|wikipedia dump]]. The file downloaded from wikipedia was the pages-articles.xml.bz2 which when uncompressed is around 18GB on disk.
  
@@ -454, +460 @@

  </dataConfig>
  }}}
  The relevant portion of schema.xml is below:
+ 
  {{{
  <field name="id"        type="integer" indexed="true" stored="true" required="true"/>
  <field name="title"     type="string"  indexed="true" stored="false"/>
@@ -467, +474 @@

  <uniqueKey>id</uniqueKey>
  <copyField source="title" dest="titleText"/>
  }}}
- 
  Time taken was around 2 hours 40 minutes to index 7278241 articles with peak memory usage at around 4GB. Note that many wikipedia articles are merely redirects to other articles, the use of $skipDoc <!> [[Solr1.4]] allows those articles to be ignored. Also, the column '''$skipDoc''' is only defined when the regexp matches.
  
  == Using delta-import command ==
- The only !EntityProcessor which supports delta is !SqlEntityProcessor! The XPathEntityProcessor has not implemented it yet. So, unfortunately, there is no delta support for XML at this time.
+ The only !EntityProcessor which supports delta is !SqlEntityProcessor! The XPathEntityProcessor has not implemented it yet. So, unfortunately, there is no delta support for XML at this time. If you want to implement those methods in XPathEntityProcessor: The methods are explained in !EntityProcessor.java.
- If you want to implement those methods in XPathEntityProcessor: The methods are explained in !EntityProcessor.java.
  
  = Indexing Emails =
  See MailEntityProcessor
  
  = Tika Integration =
- <!> [[Solr3.1]] [[TikaEntityProcessor]]
+ <!> [[Solr3.1]] TikaEntityProcessor
  
  = Extending the tool with APIs =
  The examples we explored are admittedly, trivial . It is not possible to have all user needs met by an xml configuration alone. So we expose a few abstract class which can be implemented by the user to enhance the functionality.
  
  <<Anchor(transformer)>>
+ 
  == Transformer ==
  Every set of fields fetched by the entity can be either consumed directly by the indexing process or they can be massaged using transformers to modify a field or create a totally new set of fields, it can even return more than one row of data. The transformers must be configured on an entity level as follows.
+ 
  {{{
  <entity name="foo" transformer="com.foo.Foo" ... />
  }}}
@@ -495, +502 @@

  
  The entity transformer attribute can consist of a comma separated list of transformers (`say transformer="foo.X,foo.Y"`). The transformers are chained in this case and they are applied one after the other in the order in which they are specified. What this means is that after the fields are fetched from the datasource, the list of entity columns are processed one at a time in the order listed inside the entity tag and scanned by the first transformer to see if any of that transformers attributes are present. If so the transformer does it's thing! When all of the listed entity columns have been scanned the process is repeated using the next transformer in the list.
  
+ A transformer can be used to alter the value of a field fetched from the datasource or to populate an undefined field. If the action of the transformer fails, say a regex fails to match, then an existing field will be unaltered and an undefined field will remain undefined. The chaining effect described above allows a column's value to be altered again and again by successive transformers. A transformer may make use of other entity fields in the course of massaging a columns value.
- A transformer can be used to alter the value of a field fetched from the datasource or to populate an undefined field. If the action of the transformer fails, say a regex fails to match, then an
- existing field will be unaltered and an undefined field will remain undefined. The chaining effect described above allows a column's value to be altered again and again by successive transformers. A transformer may make use of other entity fields in the course of massaging a columns value.
- 
- 
  
  === RegexTransformer ===
- 
  There is an built-in transformer called '!RegexTransfromer' provided with DIH. It helps in extracting or manipulating values from fields (from the source) using Regular Expressions. The actual class name is `org.apache.solr.handler.dataimport.RegexTransformer`. But as it belongs to the default package the package-name can be omitted.
  
- 
- 
  '''Attributes'''
  
  !RegexTransfromer is only activated for fields with an attribute of 'regex' or 'splitBy'. Other fields are ignored.
+ 
   * '''`regex`''' : The regular expression that is used to match against the column or sourceColName's value(s). If `replaceWith` is absent, each regex ''group'' is taken as a value and a list of values is returned
   * '''`sourceColName`''' : The column on which the regex is to be applied. If this is absent source and target are same
   * '''`splitBy`''' : Used to split a String to obtain multiple values, returns a list of values
@@ -516, +518 @@

   * '''`replaceWith`''' : Used along with `regex` . It is equivalent to the method `new String(<sourceColVal>).replaceAll(<regex>, <replaceWith>)`
  
  example:
+ 
  {{{
  <entity name="foo" transformer="RegexTransformer"
  query="select full_name , emailids from foo"/>
@@ -529, +532 @@

     <field column="mailId" splitBy="," sourceColName="emailids"/>
  </entity>
  }}}
- 
  In this example the attributes 'regex' and 'sourceColName' are custom attributes used by the transformer. It reads the field 'full_name' from the resultset and transforms it to two new target fields 'firstName' and 'lastName'. So even though the query returned only one column 'full_name' in the resultset the solr document gets two extra fields 'firstName' and 'lastName' which are 'derived' fields. These new fields are only created if the regexp matches.
  
  The 'emailids' field in the table can be a comma separated value. So it ends up giving out one or more than one email ids and we expect the 'mailId' to be a multivalued field in Solr.
  
  <!> Note that this transformer can either be used to split a string into tokens based on a '''`splitBy`''' pattern, or to perform a string substitution as per '''`replaceWith`''', or it can assign groups within a pattern to a list of '''`groupNames`'''. It decides what it is to do based upon the above attributes '''`splitBy`''', '''`replaceWith`''' and  '''`groupNames`''' which are looked for in order. This first one found is acted upon and other unrelated attributes are ignored.
- 
  
  === ScriptTransformer ===
  It is possible to write transformers in Javascript or any other scripting language supported by Java. You must use '''Java 6''' to use this feature.
  
  {{{
  <dataConfig>
- 	<script><![CDATA[
+         <script><![CDATA[
- 		function f1(row)	{
+                 function f1(row)        {
- 		    row.put('message', 'Hello World!');
+                     row.put('message', 'Hello World!');
- 		    return row;
- 		}
- 	]]></script>
- 	<document>
+                     return row;
+                 }
+         ]]></script>
+         <document>
- 		<entity name="e" pk="id" transformer="script:f1" query="select * from X">
+                 <entity name="e" pk="id" transformer="script:f1" query="select * from X">
                  ....
                  </entity>
          </document>
  </dataConfig>
  }}}
- 
  Another more complex example
+ 
  {{{
  <dataConfig>
- 	<script><![CDATA[
+         <script><![CDATA[
- 		function CategoryPieces(row)	{
+                 function CategoryPieces(row)    {
- 		    var pieces = row.get('category').split('/');
+                     var pieces = row.get('category').split('/');
                      var arr = new java.util.ArrayList();
- 		    for (var i=0; i<pieces.length; i++) {
+                     for (var i=0; i<pieces.length; i++) {
                         arr.add(pieces[i]);
                      }
                      row.put('categorypieces', arr);
                      row.remove('category');
                      return row;
- 		}
- 	]]></script>
- 	<document>
+                 }
+         ]]></script>
+         <document>
- 		<entity name="e" pk="id" transformer="script:CategoryPieces" query="select * from X">
+                 <entity name="e" pk="id" transformer="script:CategoryPieces" query="select * from X">
                  ....
                  </entity>
          </document>
  </dataConfig>
  }}}
- 
   * You can put a script tag inside the ''dataConfig'' node. By default, the language is assumed to be Javascript. In case you're using another language, specify on the script tag with attribute `'language="MyLanguage"'` (must be supported by java 6)
   * Write as many transformer functions as you want to use. Each such function must accept a ''row'' variable corresponding to ''Map<String, Object>'' and return a row (after applying transformations)
   * To remove entries from the row use row.remove(keyname);
@@ -590, +590 @@

   * The semantics of execution is same as that of a java transformer. The method can have two arguments as in 'transformRow(Map<String,Object> , Context context) in the abstract class 'Transformer' . As it is javascript the second argument may be omittted and it still works.
  
  <<Anchor(DateFormatTransformer)>>
+ 
  === DateFormatTransformer ===
  There is a built-in transformer called the !DateFormatTransformer which is useful for parsing date/time strings into java.util.Date instances.
  
  {{{
  <field column="date" xpath="/RDF/item/date" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss" />
  }}}
- 
  '''Attributes'''
  
  !DateFormatTransformer applies only on the fields with an attribute 'dateTimeFormat' . All other fields are left as it is.
+ 
   * '''`dateTimeFormat`''' : The format used for parsing this field. This must comply with the syntax of java [[http://java.sun.com/j2se/1.4.2/docs/api/java/text/SimpleDateFormat.html|SimpleDateFormat]].
   * '''`sourceColName`''' : The column on which the dateFormat is to be applied. If this is absent source and target are same
  
  The above field definition is used in the RSS example to parse the publish date of the RSS feed item.
  
  === NumberFormatTransformer ===
- Can be used to parse a number from a String. Uses the !NumberFormat class in java
+ Can be used to parse a number from a String. Uses the !NumberFormat class in java eg:
- eg:
+ 
  {{{
  <field column="price" formatStyle="number" />
  }}}
- 
  By default, !NumberFormat uses the system's default locale to parse the given string. If you want to specify a different locale then you can specify it as an attribute. e.g.
+ 
  {{{
  <field column="price" formatStyle="number" locale="de-DE" />
  }}}
- 
  '''Attributes'''
  
  !NumberFormatTransformer applies only on the fields with an attribute 'formatStyle' .
+ 
   * '''`formatStyle`''' : The format used for parsing this field The value of the attribute must be one of (number|percent|integer|currency). This uses the semantics of java [[http://java.sun.com/j2se/1.4.2/docs/api/java/text/NumberFormat.html|NumberFormat]].
   * '''`sourceColName`''' : The column on which the !NumberFormat is to be applied. If this is absent, source and target are same.
   * '''`locale`''' : The locale to be used for parsing the strings. If this is absent, the system's default locale is used. It must be specified as language-country. For example en-US.
  
  === TemplateTransformer ===
- Can be used to overwrite or modify any existing Solr field or to create new Solr fields. The value assigned to the field is based on a static template string, which can contain DIH variables. If a template string contains placeholders or variables they must be defined when the transformer is being evaluated. An undefined variable causes the entire template instruction to be ignored.
+ Can be used to overwrite or modify any existing Solr field or to create new Solr fields. The value assigned to the field is based on a static template string, which can contain DIH variables. If a template string contains placeholders or variables they must be defined when the transformer is being evaluated. An undefined variable causes the entire template instruction to be ignored. eg:
- eg:
+ 
  {{{
  <entity name="e" transformer="TemplateTransformer" ..>
  <field column="namedesc" template="hello${e.name},${eparent.surname}" />
@@ -642, +643 @@

  === HTMLStripTransformer ===
  <!> [[Solr1.4]]
  
- Can be used to strip HTML out of a string field
+ Can be used to strip HTML out of a string field e.g.:
- e.g.:
+ 
  {{{
  <entity name="e" transformer="HTMLStripTransformer" ..>
  <field column="htmlText" stripHTML="true" />
@@ -659, +660 @@

  === ClobTransformer ===
  <!> [[Solr1.4]]
  
- Can be used to create a String out of a Clob type in database.
+ Can be used to create a String out of a Clob type in database. e.g.:
- e.g.:
+ 
  {{{
  <entity name="e" transformer="ClobTransformer" ..>
  <field column="hugeTextField" clob="true" />
  ...
  </entity>
  }}}
- 
  '''Attributes'''
  
   * '''`clob`''' : Boolean value to signal if !ClobTransformer should process this field or not.
@@ -676, +676 @@

  === LogTransformer ===
  <!> [[Solr1.4]]
  
- Can be used to Log data to console/logs.
+ Can be used to Log data to console/logs. e.g.:
- e.g.:
+ 
  {{{
  <entity ...
  transformer="LogTransformer"
  logTemplate="The name is ${e.name}" logLevel="debug" >
  ....
- </entity>}}}
+ </entity>
- 
+ }}}
  Unlike other Transformers this does not apply to any field so the attributes are applied on the entity itself.
  
+ Valid logLevels are:
+ 
+  1. trace
+  1. debug
+  1. info
+  1. warn
+  1. error
+ 
+ which have to be specified casesensitive (all lowercase).
  
  <<Anchor(example-transformers)>>
+ 
  === Transformers Example ===
- 
  <!> [[Solr1.4]] The following example shows transformer chaining in action along with extensive reuse of variables. An invariant is defined in the solrconfig.xml and reused within some transforms. Column names from both entities are also used in transforms.
  
  Imagine we have XML documents, each of which describes a set of images. The images are stored in an images subdirectory of the XML document. An attribute storing an images filename is accompanied by a brief caption and a relative link to another document holding a longer description of the image. Finally the image name if preceded by an 's' links to a smaller icon sized version of the image which is always a png. We want SOLR to store fields containing the absolute link to the image, its icon and the full description. The following shows one way we could configure solrconfig.xml and DIH's data-config.xml to index this data.
@@ -707, +716 @@

         </lst>
      </requestHandler>
  }}}
- 
- 
  {{{
   <dataConfig>
   <dataSource name="myfilereader" type="FileDataSource"/>
     <document>
       <entity name="jc" rootEntity="false" dataSource="null"
- 	     processor="FileListEntityProcessor"
+              processor="FileListEntityProcessor"
- 	     fileName="^.*\.xml$" recursive="true"
+              fileName="^.*\.xml$" recursive="true"
- 	     baseDir="/usr/local/apache2/htdocs/imagery"
+              baseDir="/usr/local/apache2/htdocs/imagery"
               >
         <entity name="x"rootEntity="true"
- 	       dataSource="myfilereader"
+                dataSource="myfilereader"
- 	       processor="XPathEntityProcessor"
+                processor="XPathEntityProcessor"
- 	       url="${jc.fileAbsolutePath}"
+                url="${jc.fileAbsolutePath}"
- 	       stream="false" forEach="/mediaBlock"
+                stream="false" forEach="/mediaBlock"
- 	       transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer,LogTransformer"
+                transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer,LogTransformer"
                 logTemplate="      processing ${jc.fileAbsolutePath}"
                 logLevel="info"
                 >
@@ -752, +759 @@

     </document>
    </dataConfig>
  }}}
- 
  <<Anchor(custom-transformers)>>
+ 
  === Writing Custom Transformers ===
  It is simple to add you own transformers and this documented on the page [[DIHCustomTransformer]]
  
  <<Anchor(entityprocessor)>>
+ 
  == EntityProcessor ==
- Each entity is handled by a default Entity processor called !SqlEntityProcessor. This works well for systems which use RDBMS as a datasource. For other kind of datasources like  REST or Non Sql datasources you can choose to extend this abstract class `org.apache.solr.handler.dataimport.Entityprocessor`. This is designed to Stream rows one by one from an entity. The simplest way to implement your own !EntityProcessor is to extend !EntityProcessorBase and override the `public Map<String,Object> nextRow()` method.
+ Each entity is handled by a default Entity processor called !SqlEntityProcessor. This works well for systems which use RDBMS as a datasource. For other kind of datasources like  REST or Non Sql datasources you can choose to extend this abstract class `org.apache.solr.handler.dataimport.Entityprocessor`. This is designed to Stream rows one by one from an entity. The simplest way to implement your own !EntityProcessor is to extend !EntityProcessorBase and override the `public Map<String,Object> nextRow()` method. '!EntityProcessor' rely on the !DataSource for fetching data. The return type of the !DataSource is important for an !EntityProcessor. The built-in ones are,
- '!EntityProcessor' rely on the !DataSource for fetching data. The return type of the !DataSource is important for an !EntityProcessor. The built-in ones are,
  
  === SqlEntityProcessor ===
  This is the defaut. The !DataSource must be of type `DataSource<Iterator<Map<String, Object>>>` . !JdbcDataSource can be used with this.
@@ -770, +777 @@

  
  === FileListEntityProcessor ===
  A simple entity processor which can be used to enumerate the list of files from a File System based on some criteria. It does not use a !DataSource. The entity attributes are:
+ 
   * '''`fileName`''' :(required) A regex pattern to identify files
   * '''`baseDir`''' : (required) The Base directory (absolute path)
   * '''`recursive`''' : Recursive listing or not. Default is 'false'
@@ -782, +790 @@

   * '''`dataSource`''' :If you use Solr1.3 It must be set to "null" because this does not use any !DataSource. No need to specify that in Solr1.4 .It just means that we won't create a !DataSource instance. (In most of the cases there is only one !DataSource (A !JdbcDataSource) and all entities just use them. In case of !FileListEntityProcessor a !DataSource is not necessary.)
  
  example:
+ 
  {{{
  <dataConfig>
      <dataSource type="FileDataSource" />
@@ -802, +811 @@

  This is an extension of the !SqlEntityProcessor.  This !EntityProcessor helps reduce the no: of DB queries executed by caching the rows. It does not help to use it in the root most entity because only one sql is run for the entity.
  
  Example 1.
+ 
  {{{
  <entity name="x" query="select * from x">
      <entity name="y" query="select * from y where xid=${x.id}" processor="CachedSqlEntityProcessor">
      </entity>
  <entity>
  }}}
- 
  The usage is exactly same as the other one. When a query is run the results are stored and if the same query is run again it is fetched from the cache and returned
  
  Example 2:
+ 
  {{{
  <entity name="x" query="select * from x">
      <entity name="y" query="select * from y" processor="CachedSqlEntityProcessor"  where="xid=x.id">
      </entity>
  <entity>
  }}}
- 
  The difference with the previous one is the 'where' attribute. In this case the query fetches all the rows from the table and stores all the rows in the cache. The magic is in the 'where' value. The cache stores the values with the 'xid' value in 'y' as the key. The value for 'x.id' is evaluated every time the entity has to be run and the value is looked up in the cache an the rows are returned.
  
  In the where the lhs (the part before '=') is the column in y and the rhs (the part after '=') is the value to be computed for looking up the cache.
+ 
  ----
- For more caching options with DIH see [[https://issues.apache.org/jira/browse/SOLR-2382|https://issues.apache.org/jira/browse/SOLR-2382]].  These additional options include:  using caches with non-sql entities, pluggable cache implementations, persistent caches, writing DIH output to a cache rather than directly to solr, using a previously-created cache as a DIH entity's input & delta updates on cached data.
+ For more caching options with DIH see https://issues.apache.org/jira/browse/SOLR-2382.  These additional options include:  using caches with non-sql entities, pluggable cache implementations, persistent caches, writing DIH output to a cache rather than directly to solr, using a previously-created cache as a DIH entity's input & delta updates on cached data.
+ 
  ----
  === PlainTextEntityProcessor ===
+ <<Anchor(plaintext)>> <!> [[Solr1.4]]
- <<Anchor(plaintext)>>
- <!> [[Solr1.4]]
  
  This !EntityProcessor reads all content from the data source into an single implicit field called 'plainText'. The content is not parsed in any way, however you may add transformers to manipulate the data within 'plainText' as needed or to create other additional fields.
  
  example:
+ 
  {{{
  <entity processor="PlainTextEntityProcessor" name="x" url="http://abc.com/a.txt" dataSource="data-source-name">
     <!-- copies the text to a field called 'text' in Solr-->
    <field column="plainText" name="text"/>
  </entity>
  }}}
- 
  Ensure that the dataSource is of type !DataSource<Reader> (!FileDataSource, URLDataSource)
  
  === LineEntityProcessor ===
- <<Anchor(LineEntityProcessor)>>
+ <<Anchor(LineEntityProcessor)>> <!> [[Solr1.4]]
- <!> [[Solr1.4]]
  
  This !EntityProcessor reads all content from the data source on a line by line basis, a field called 'rawLine' is returned for each line read. The content is not parsed in any way, however you may add transformers to manipulate the data within 'rawLine' or to create other additional fields.
  
- The lines read can be filtered by two regular expressions '''acceptLineRegex''' and '''omitLineRegex'''.
+ The lines read can be filtered by two regular expressions '''acceptLineRegex''' and '''omitLineRegex'''. This entities additional attributes are:
- This entities additional attributes are:
+ 
   * '''`url`''' : a required attribute that specifies the location of the input file in a way that is compatible with the configured datasource. If this value is relative and you are using !FileDataSource or URLDataSource, it assumed to be relative to '''baseLoc'''.
   * '''`acceptLineRegex`''' :an optional attribute that if present discards any line which does not match the regExp.
   * '''`omitLineRegex`''' : an optional attribute that is applied after any acceptLineRegex and discards any line which matches this regExp.
+ 
  example:
+ 
  {{{
  <entity name="jc"
          processor="LineEntityProcessor"
@@ -865, +876 @@

          >
     ...
  }}}
- While there are use cases where you might need to create a solr document per line read from a file, it is expected that in most cases that the lines read will consist of a pathname which is in turn consumed by another !EntityProcessor
+ While there are use cases where you might need to create a solr document per line read from a file, it is expected that in most cases that the lines read will consist of a pathname which is in turn consumed by another !EntityProcessor such as XPathEntityProcessor.
- such as XPathEntityProcessor.
+ 
  ----
- See [[https://issues.apache.org/jira/browse/SOLR-2549|https://issues.apache.org/jira/browse/SOLR-2549]] for a patch that extends LineEntityProcessor to support fixed-width and delimited files without needing to use a Transformer.
+ See https://issues.apache.org/jira/browse/SOLR-2549 for a patch that extends LineEntityProcessor to support fixed-width and delimited files without needing to use a Transformer.
+ 
  ----
  
  == DataSource ==
- <<Anchor(datasource)>>
- A class can extend `org.apache.solr.handler.dataimport.DataSource` . [[http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/DataSource.java?view=markup|See source]]
+ <<Anchor(datasource)>> A class can extend `org.apache.solr.handler.dataimport.DataSource` . [[http://svn.apache.org/viewvc/lucene/dev/trunk/solr/contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/DataSource.java?view=markup|See source]]
  
  and can be used as a !DataSource. It must be configured in the dataSource definition
+ 
  {{{
  <dataSource type="com.foo.FooDataSource" prop1="hello"/>
  }}}
@@ -883, +895 @@

  
  === JdbcdataSource ===
  This is the default. See the  [[#jdbcdatasource|example]] . The signature is as follows
+ 
  {{{
  public class JdbcDataSource extends DataSource<Iterator<Map<String, Object>>>
  }}}
  It is designed to iterate rows in DB one by one. A row is represented as a Map.
  
  === URLDataSource ===
- <!> [[Solr1.4]]
- This datasource is often used with XPathEntityProcessor to fetch content from an underlying file:// or http:// location. See the documentation [[#httpds|here]] . The signature is as follows
+ <!> [[Solr1.4]] This datasource is often used with XPathEntityProcessor to fetch content from an underlying file:// or http:// location. See the documentation [[#httpds|here]] . The signature is as follows
+ 
  {{{
  public class URLDataSource extends DataSource<Reader>
  }}}
- 
  === HttpDataSource ===
  <!> !HttpDataSource is being deprecated in favour of URLDataSource in [[Solr1.4]]. There is no change in functionality between URLDataSource and !HttpDataSource, only a name change.
  
  === FileDataSource ===
  This can be used like an URLDataSource but used to fetch content from files on disk. The only difference from URLDataSource, when accessing disk files, is how a pathname is specified. The signature is as follows
+ 
  {{{
  public class FileDataSource extends DataSource<Reader>
  }}}
- 
  The attributes are:
+ 
   * '''`basePath`''': (optional) The base path relative to which the value is evaluated if it is not absolute
   * '''`encoding`''': (optional) If the files are to be read in an encoding that is not same as the platform encoding
  
@@ -912, +925 @@

  <!> [[Solr1.4]]
  
  This can be used like an URLDataSource . The signature is as follows
+ 
  {{{
  public class FieldReaderDataSource extends DataSource<Reader>
  }}}
- This can be useful for users who have a DB field containing XML and wish to use a nested XPathEntityProcessor to process the fields contents.
+ This can be useful for users who have a DB field containing XML and wish to use a nested XPathEntityProcessor to process the fields contents. The datasouce may be configured as follows
- The datasouce may be configured as follows
+ 
  {{{
    <dataSource name="f" type="FieldReaderDataSource" />
  }}}
- 
  The enity which uses this datasource must keep the url value as the variable name dataField="field-name". For instance , if the parent entity 'dbEntity' has a field called 'xmlData' . Then he child entity would look like,
+ 
  {{{
  <entity dataSource="f" processor="XPathEntityProcessor" dataField="dbEntity.xmlData"/>
  }}}
- 
  === ContentStreamDataSource ===
  <!> [[Solr1.4]]
  
@@ -941, +954 @@

  </document>
  </dataConfig>
  }}}
- 
  == Special Commands ==
  Special commands can be given to DIH by adding certain variables to the row returned by any of the components .
+ 
   * '''`$skipDoc`''' : Skip the current document . Do not add it to Solr. The value can be String true/false
   * '''`$skipRow`''' : Skip the current row. The document will be added with rows from other entities. The value can be String true/false
   * '''`$docBoost`''' : Boost the current doc. The value can be a number or the toString of a number
   * '''`$deleteDocById`''' : Delete a doc from Solr with this id. The value has to be the uniqueKey value of the document. Note that this command can only delete docs already committed to the index. <!> [[Solr1.4]]
   * '''`$deleteDocByQuery`''' :Delete docs from Solr by this query. The value must be a Solr Query <!> [[Solr1.4]]
+ 
- Note: prior to Solr 3.4, $deleteDocById and $deleteDocByQuery do not increment the "# deletes processed" statistic.  Also, if a component ''only'' deletes documents using these special commands, DIH will not commit the changes.  With Solr 3.4 and later, "commit" is always called as expected and the "# deletes processed" statistic is incremented by 1 for each call to $deleteDocById and/or $deleteDocByQuery.  This may not accurately reflect the actual number of documents deleted as these commands (especially $deleteDocByQuery) can delete more than 1 document (or no documents) per call.  See [[https://issues.apache.org/jira/browse/SOLR-2492|https://issues.apache.org/jira/browse/SOLR-2492]] for a more information.
+ Note: prior to Solr 3.4, $deleteDocById and $deleteDocByQuery do not increment the "# deletes processed" statistic.  Also, if a component ''only'' deletes documents using these special commands, DIH will not commit the changes.  With Solr 3.4 and later, "commit" is always called as expected and the "# deletes processed" statistic is incremented by 1 for each call to $deleteDocById and/or $deleteDocByQuery.  This may not accurately reflect the actual number of documents deleted as these commands (especially $deleteDocByQuery) can delete more than 1 document (or no documents) per call.  See https://issues.apache.org/jira/browse/SOLR-2492 for a more information.
  
  == Adding datasource in solrconfig.xml ==
  <<Anchor(solrconfigdatasource)>>
  
  It is possible to configure datasource in solrconfig.xml as well as the data-config.xml, however the datasource attributes are expressed differently.
+ 
  {{{
    <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
      <lst name="defaults">
@@ -969, +984 @@

    </requestHandler>
  }}}
  <<Anchor(arch)>>
+ 
  = Architecture =
  The following diagram describes the logical flow for a sample configuration.
  
  {{attachment:DataImportHandlerOverview.png}}
  
- The use case is as follows:
- There are 3 datasources two RDBMS (jdbc1,jdbc2) and one xml/http (B)
+ The use case is as follows: There are 3 datasources two RDBMS (jdbc1,jdbc2) and one xml/http (B)
  
   * `jdbc1` and `jdbc2` are instances of  type `JdbcDataSource` which are configured in the solrconfig.xml.
   * `http` is an instance of type `HttpDataSource`
@@ -984, +999 @@

   * On doing a `command=full-import` The root-entity (A) is executed first
   * Each row that emitted by the 'query' in entity 'A' is fed into its sub entities B, C
   * The queries in B and C use a column in 'A' to construct their queries using placeholders like `${A.a}`
-    * B has a url  (B is an xml/http datasource)
+   * B has a url  (B is an xml/http datasource)
-    * C has a query
+   * C has a query
   * C has two transformers ('f' and 'g' )
   * Each row that comes out of C is fed into 'f' and 'g' sequentially (transformers are chained) . Each transformer can change the input. Note that the transformer 'g' produces 2 output rows for an input row `f(C.1))
   * The end output of each entity is combined together to construct a document
-    * Note that the intermediate rows from C i.e `C.1, C.2, f(C.1) , f(C1)` are ignored
+   * Note that the intermediate rows from C i.e `C.1, C.2, f(C.1) , f(C1)` are ignored
+ 
  == Field declarations ==
  Fields declared in the <entity> tags help us provide extra information which cannot be derived automatically. The tool relies on the 'column' values to fetch values from the results. The fields you explicitly add in the configuration are equivalent to the fields which are present in the solr schema.xml (implicit fields). It automatically inherits all the attributes present in the schema.xml. Just that you cannot add extra configuration. Add the field entries when,
+ 
   * The fields emitted from the !EntityProcessor has a different name than the field in schema.xml
   * Built-in transformers expect extra information to decide which fields to process and how to process
   * XPathEntityprocessor or any other processors which explicitly demand extra information in each fields
+ 
  == What is a row? ==
  A row in !DataImportHandler is a Map (Map<String, Object>). In the map , the key is the name of the field and the value can be anything which is a valid Solr type. The value can also be a Collection of the valid Solr types (this may get mapped to a multi-valued field). If the !DataSource is RDBMS a query cannot emit a multivalued field. But it is possible to create a multivalued field by joining an entity with another.i.e if the sub-entity returns multiple rows for one row from parent entity it can go into a multivalued field. If the datasource is xml, it is possible to return a multivalued field.
  
  == A VariableResolver ==
  A !VariableResolver is the component which replaces all those placeholders such as `${<name>}`. It is a multilevel Map.  Each namespace is a Map and namespaces are separated by periods (.) . eg if there is a placeholder ${item.ID} , 'item' is a nampespace (which is a map) and 'ID' is a value in that namespace. It is possible to nest namespaces like ${item.x.ID} where x could be another Map. A reference to the current !VariableResolver can be obtained from the Context. Or the object can be directly consumed by using ${<name>} in 'query' for RDMS queries or 'url' in Http .
+ 
  === Custom formatting in query and url using Functions ===
  While the namespace concept is useful , the user may want to put some computed value into the query or url for example there is a Date object and your datasource  accepts Date in some custom format . There are a few functions provided by the !DataImportHandler which can do some of these.
+ 
   * ''formatDate'' : It is used like this `'${dataimporter.functions.formatDate(item.ID, 'yyyy-MM-dd HH:mm')}'` . The first argument can be a valid value from the !VariableResolver and the second value can be a format string (use !SimpleDateFormat) . The first argument can be a computed value eg: `'${dataimporter.functions.formatDate('NOW-3DAYS', 'yyyy-MM-dd HH:mm')}'` and it uses the syntax of the datemath parser in Solr. (note that it must enclosed in single quotes) . <!> Note . This syntax has been changed in 1.4 . The second parameter was not enclosed in single quotes earlier. But it will continue to work without single quote also.
   * ''escapeSql'' : Use this to escape special sql characters . eg : `'${dataimporter.functions.escapeSql(item.ID)}'`. Takes only one argument and must be a valid value in the !VaraiableResolver.
   * ''encodeUrl'' : Use this to encode urls . eg : `'${dataimporter.functions.encodeUrl(item.ID)}'` . Takes only one argument and must be a valid value in the !VariableResolver
  
  ==== Custom Functions ====
  It is possible to plug in custom functions into DIH. Implement an [[http://lucene.apache.org/solr/api/org/apache/solr/handler/dataimport/Evaluator.html|Evaluator]] and specify it in the data-config.xml . Following is an example of an evaluator which does a 'toLowerCase' on a String.
+ 
  {{{
  <dataConfig>
    <function name="toLowerCase" class="foo.LowerCaseFunctionEvaluator"/>
@@ -1018, +1039 @@

    </document>
  </dataConfig>
  }}}
- 
  The implementation of !LowerCaseFunctionEvaluator
+ 
  {{{
    public class LowerCaseFunctionEvaluator extends Evaluator{
      public String evaluate(String expression, Context context) {
@@ -1034, +1055 @@

  
    }
  }}}
- 
  === Accessing request parameters ===
  All http request parameters sent to SOLR when using the dataimporter can be accessed using the 'request' namespace eg: `'${dataimporter.request.command}'` will return the command that was run.
  
  <<Anchor(interactive)>>
+ 
  = Interactive Development Mode =
  This is a new cool and powerful feature in the tool. It helps you build a dataconfig.xml with the UI. It can be accessed from http://host:port/solr/admin/dataimport.jsp . The features are
+ 
   * A UI with two panels . RHS takes in the input and LHS shows the output
   * When you hit the button 'debug now' it runs the configuration and shows the documents created
   * You can configure the start and rows parameters to debug documents say 115 to 118 .
@@ -1052, +1074 @@

  {{attachment:interactive-dev-dataimporthandler.PNG}}
  
  <<BR>>
+ 
  ----
- 
  <<Anchor(scheduling)>>
+ 
  = Scheduling =
  {i}
+ 
-  * Data``Import``Scheduler
+  * DataImportScheduler
   * Version 1.2
   * Last revision: 20.09.2010.
   * Author: Marko Bonaci
@@ -1068, +1092 @@

   * Successfully tested on ''Apache Tomcat v6'' (should work on any other servlet container)
   * Hasn't been committed to SVN (published only here)
  
+ <<BR>> <!> TODO:
- <<BR>>
- <!> TODO:
  
-  * enable user to create multiple scheduled tasks (List<Data``Import``Scheduler>)
+  * enable user to create multiple scheduled tasks (List<DataImportScheduler>)
   * add ''cancel'' functionality (to be able to completely disable ''DIHScheduler'' background thread, without stopping the app/server). Currently, sync can be disabled by setting  ''syncEnabled'' param to anything other than "1" in ''dataimport.properties'', but the background thread still remains active and reloads the properties file on every run (so that sync can be hot-redeployed)
   * try to use Solr's classes wherever possible
   * add javadoc style comments
  
  <<BR>>
+ 
  == Prereqs ==
-  {1} working DIH configuration in place <<BR>>
-  {2} ''dataimport.properties'' file in folder ''solr.home/conf/'' with mandatory params inside (see bellow for the example of ''dataimport.properties'') <<BR>>
+  . {1} working DIH configuration in place <<BR>> {2} ''dataimport.properties'' file in folder ''solr.home/conf/'' with mandatory params inside (see bellow for the example of ''dataimport.properties'') <<BR>>
  
  {OK} Revisions:
  
@@ -1088, +1111 @@

    * parametrized the schedule interval (in minutes)
  
   * v1.1:
-   * now using ''Solr``Resource``Loader'' to get ''solr.home'' (as opposed to ''System properties'' in v1.0)
+   * now using ''SolrResourceLoader'' to get ''solr.home'' (as opposed to ''System properties'' in v1.0)
    * forces reloading of the properties file if the response code is not 200
    * logging done using ''slf4j'' (used ''System.out'' in v1.0)
  
@@ -1096, +1119 @@

    * initial release
  
  <<BR>>
+ 
  == SolrDataImportProperties ==
   * uses [[http://download.oracle.com/javase/6/docs/api/java/util/Properties.html|java.util.Properties]] to load settings from ''dataimport.properties''
  
@@ -1112, +1136 @@

  import org.slf4j.LoggerFactory;
  
  public class SolrDataImportProperties {
- 	private Properties properties;
+         private Properties properties;
- 	
+ 
- 	public static final String SYNC_ENABLED		= "syncEnabled";
+         public static final String SYNC_ENABLED         = "syncEnabled";
- 	public static final String SYNC_CORES		= "syncCores";
+         public static final String SYNC_CORES           = "syncCores";
- 	public static final String SERVER 		= "server";
+         public static final String SERVER               = "server";
- 	public static final String PORT 		= "port";
+         public static final String PORT                 = "port";
- 	public static final String WEBAPP 		= "webapp";
+         public static final String WEBAPP               = "webapp";
- 	public static final String PARAMS 		= "params";
+         public static final String PARAMS               = "params";
- 	public static final String INTERVAL		= "interval";
+         public static final String INTERVAL             = "interval";
- 	
+ 
- 	private static final Logger logger = LoggerFactory.getLogger(SolrDataImportProperties.class);
+         private static final Logger logger = LoggerFactory.getLogger(SolrDataImportProperties.class);
- 	
+ 
- 	public SolrDataImportProperties(){
+         public SolrDataImportProperties(){
- //		loadProperties(true);
- 	}
- 	
+ //              loadProperties(true);
+         }
+ 
- 	public void loadProperties(boolean force){
+         public void loadProperties(boolean force){
- 		try{
+                 try{
- 			SolrResourceLoader loader = new SolrResourceLoader(null);
+                         SolrResourceLoader loader = new SolrResourceLoader(null);
- 			logger.info("Instance dir = " + loader.getInstanceDir());
+                         logger.info("Instance dir = " + loader.getInstanceDir());
- 			
+ 
- 			String configDir = loader.getConfigDir();
+                         String configDir = loader.getConfigDir();
- 			configDir = SolrResourceLoader.normalizeDir(configDir);
+                         configDir = SolrResourceLoader.normalizeDir(configDir);
- 			if(force || properties == null){
- 				properties = new Properties();
- 							
+                         if(force || properties == null){
+                                 properties = new Properties();
+ 
- 				String dataImportPropertiesPath = configDir + "\\dataimport.properties";
+                                 String dataImportPropertiesPath = configDir + "\\dataimport.properties";
- 				
+ 
- 				FileInputStream fis = new FileInputStream(dataImportPropertiesPath);
+                                 FileInputStream fis = new FileInputStream(dataImportPropertiesPath);
- 				properties.load(fis);
- 			}
+                                 properties.load(fis);
+                         }
- 		}catch(FileNotFoundException fnfe){
+                 }catch(FileNotFoundException fnfe){
- 			logger.error("Error locating DataImportScheduler dataimport.properties file", fnfe);
+                         logger.error("Error locating DataImportScheduler dataimport.properties file", fnfe);
- 		}catch(IOException ioe){
+                 }catch(IOException ioe){
- 			logger.error("Error reading DataImportScheduler dataimport.properties file", ioe);
+                         logger.error("Error reading DataImportScheduler dataimport.properties file", ioe);
- 		}catch(Exception e){
+                 }catch(Exception e){
- 			logger.error("Error loading DataImportScheduler properties", e);
+                         logger.error("Error loading DataImportScheduler properties", e);
- 		}
- 	}
- 	
+                 }
+         }
+ 
- 	public String getProperty(String key){
+         public String getProperty(String key){
- 		return properties.getProperty(key);
+                 return properties.getProperty(key);
- 	}	
+         }
  }
  }}}
- 
  <<BR>>
+ 
  == ApplicationListener ==
   * the class implements [[http://download.oracle.com/javaee/6/api/javax/servlet/ServletContextListener.html|javax.servlet.ServletContextListener]] (listens to web app Initialize and Destroy events)
   * uses ''HTTPPostScheduler'', [[http://download.oracle.com/javase/6/docs/api/java/util/Timer.html|java.util.Timer]] and context attribute map to facilitate periodic method invocation (scheduling)
@@ -1180, +1204 @@

  
  public class ApplicationListener implements ServletContextListener {
  
- 	private static final Logger logger = LoggerFactory.getLogger(ApplicationListener.class);
+         private static final Logger logger = LoggerFactory.getLogger(ApplicationListener.class);
- 	
- 	@Override
+ 
+         @Override
- 	public void contextDestroyed(ServletContextEvent servletContextEvent) {
+         public void contextDestroyed(ServletContextEvent servletContextEvent) {
- 		ServletContext servletContext = servletContextEvent.getServletContext();
+                 ServletContext servletContext = servletContextEvent.getServletContext();
  
- 		// get our timer from the context
+                 // get our timer from the context
- 		Timer timer = (Timer)servletContext.getAttribute("timer");
+                 Timer timer = (Timer)servletContext.getAttribute("timer");
  
- 		// cancel all active tasks in the timers queue
+                 // cancel all active tasks in the timers queue
- 		if (timer != null)
- 			timer.cancel();
+                 if (timer != null)
+                         timer.cancel();
  
- 		// remove the timer from the context
+                 // remove the timer from the context
- 		servletContext.removeAttribute("timer");
+                 servletContext.removeAttribute("timer");
  
- 	}
+         }
  
- 	@Override
+         @Override
- 	public void contextInitialized(ServletContextEvent servletContextEvent) {
+         public void contextInitialized(ServletContextEvent servletContextEvent) {
- 		ServletContext servletContext = servletContextEvent.getServletContext();
+                 ServletContext servletContext = servletContextEvent.getServletContext();
- 		try{
+                 try{
- 			// create the timer and timer task objects
+                         // create the timer and timer task objects
- 			Timer timer = new Timer();
+                         Timer timer = new Timer();
- 			HTTPPostScheduler task = new HTTPPostScheduler(servletContext.getServletContextName(), timer);
+                         HTTPPostScheduler task = new HTTPPostScheduler(servletContext.getServletContextName(), timer);
- 			
+ 
- 			// get our interval from HTTPPostScheduler
+                         // get our interval from HTTPPostScheduler
- 			int interval = task.getIntervalInt();
- 			
+                         int interval = task.getIntervalInt();
+ 
- 			// get a calendar to set the start time (first run)
+                         // get a calendar to set the start time (first run)
- 			Calendar calendar = Calendar.getInstance();
+                         Calendar calendar = Calendar.getInstance();
- 			
+ 
- 			// set the first run to now + interval (to avoid fireing while the app/server is starting)
+                         // set the first run to now + interval (to avoid fireing while the app/server is starting)
- 			calendar.add(Calendar.MINUTE, interval);
- 			Date startTime = calendar.getTime();
- 			
- 			// schedule the task
+                         calendar.add(Calendar.MINUTE, interval);
+                         Date startTime = calendar.getTime();
+ 
+                         // schedule the task
- 			timer.scheduleAtFixedRate(task, startTime, 1000 * 60 * interval);
+                         timer.scheduleAtFixedRate(task, startTime, 1000 * 60 * interval);
  
- 			// save the timer in context
+                         // save the timer in context
- 			servletContext.setAttribute("timer", timer);
+                         servletContext.setAttribute("timer", timer);
- 			
- 		} catch (Exception e) {
- 			if(e.getMessage().endsWith("disabled")){
- 				logger.info("Schedule disabled");
- 			}else{
+ 
+                 } catch (Exception e) {
+                         if(e.getMessage().endsWith("disabled")){
+                                 logger.info("Schedule disabled");
+                         }else{
- 				logger.error("Problem initializing the scheduled task: ", e);	
+                                 logger.error("Problem initializing the scheduled task: ", e);
- 			}			
- 		}
- 	}
+                         }
+                 }
+         }
  
  }
  }}}
- 
  <<BR>>
+ 
  == HTTPPostScheduler ==
   * the class extends [[http://download.oracle.com/javase/6/docs/api/java/util/TimerTask.html|java.util.TimerTask]], which implements [[http://download.oracle.com/javase/6/docs/api/java/lang/Runnable.html|java.lang.Runnable]]
   * represents main ''DIHScheduler'' thread (run by ''Timer'' background thread)
@@ -1243, +1267 @@

   * invokes URL using HTTP POST request
  
  <<BR>>
+ 
  {{{
  package org.apache.solr.handler.dataimport.scheduler;
  
@@ -1261, +1286 @@

  
  
  public class HTTPPostScheduler extends TimerTask {
- 	private String syncEnabled;
+         private String syncEnabled;
- 	private String[] syncCores;
+         private String[] syncCores;
- 	private String server;
+         private String server;
- 	private String port;
+         private String port;
- 	private String webapp;
+         private String webapp;
- 	private String params;
+         private String params;
- 	private String interval;
+         private String interval;
- 	private String cores;
+         private String cores;
- 	private SolrDataImportProperties p;
+         private SolrDataImportProperties p;
- 	private boolean singleCore;
+         private boolean singleCore;
- 	
+ 
- 	private static final Logger logger = LoggerFactory.getLogger(HTTPPostScheduler.class);
+         private static final Logger logger = LoggerFactory.getLogger(HTTPPostScheduler.class);
- 	
+ 
- 	public HTTPPostScheduler(String webAppName, Timer t) throws Exception{
+         public HTTPPostScheduler(String webAppName, Timer t) throws Exception{
- 		//load properties from global dataimport.properties
+                 //load properties from global dataimport.properties
- 		p = new SolrDataImportProperties();
+                 p = new SolrDataImportProperties();
- 		reloadParams();
- 		fixParams(webAppName);
- 		
+                 reloadParams();
+                 fixParams(webAppName);
+ 
- 		if(!syncEnabled.equals("1")) throw new Exception("Schedule disabled");
+                 if(!syncEnabled.equals("1")) throw new Exception("Schedule disabled");
- 		
+ 
- 		if(syncCores == null || (syncCores.length == 1 && syncCores[0].isEmpty())){
+                 if(syncCores == null || (syncCores.length == 1 && syncCores[0].isEmpty())){
- 			singleCore = true;
+                         singleCore = true;
- 			logger.info("<index update process> Single core identified in dataimport.properties");
+                         logger.info("<index update process> Single core identified in dataimport.properties");
- 		}else{
- 			singleCore = false;
+                 }else{
+                         singleCore = false;
- 			logger.info("<index update process> Multiple cores identified in dataimport.properties. Sync active for: " + cores);
+                         logger.info("<index update process> Multiple cores identified in dataimport.properties. Sync active for: " + cores);
- 		}
- 	}
- 	
+                 }
+         }
+ 
- 	private void reloadParams(){
+         private void reloadParams(){
- 		p.loadProperties(true);
+                 p.loadProperties(true);
- 		syncEnabled = p.getProperty(SolrDataImportProperties.SYNC_ENABLED);
+                 syncEnabled = p.getProperty(SolrDataImportProperties.SYNC_ENABLED);
- 		cores 		= p.getProperty(SolrDataImportProperties.SYNC_CORES);		
+                 cores           = p.getProperty(SolrDataImportProperties.SYNC_CORES);
- 		server 		= p.getProperty(SolrDataImportProperties.SERVER);
+                 server          = p.getProperty(SolrDataImportProperties.SERVER);
- 		port 		= p.getProperty(SolrDataImportProperties.PORT);
+                 port            = p.getProperty(SolrDataImportProperties.PORT);
- 		webapp 		= p.getProperty(SolrDataImportProperties.WEBAPP);
+                 webapp          = p.getProperty(SolrDataImportProperties.WEBAPP);
- 		params 		= p.getProperty(SolrDataImportProperties.PARAMS);
+                 params          = p.getProperty(SolrDataImportProperties.PARAMS);
- 		interval	= p.getProperty(SolrDataImportProperties.INTERVAL);
+                 interval        = p.getProperty(SolrDataImportProperties.INTERVAL);
- 		syncCores 	= cores != null ? cores.split(",") : null;
+                 syncCores       = cores != null ? cores.split(",") : null;
- 	}
- 	
+         }
+ 
- 	private void fixParams(String webAppName){
+         private void fixParams(String webAppName){
- 		if(server == null || server.isEmpty()) 	server = "localhost";
+                 if(server == null || server.isEmpty())  server = "localhost";
- 		if(port == null || port.isEmpty()) 		port = "8080";
+                 if(port == null || port.isEmpty())              port = "8080";
- 		if(webapp == null || webapp.isEmpty()) 	webapp = webAppName;
+                 if(webapp == null || webapp.isEmpty())  webapp = webAppName;
- 		if(interval == null || interval.isEmpty() || getIntervalInt() <= 0) interval = "30";
+                 if(interval == null || interval.isEmpty() || getIntervalInt() <= 0) interval = "30";
- 	}
- 	
+         }
+ 
- 	public void run() {
+         public void run() {
- 		try{
- 			// check mandatory params
+                 try{
+                         // check mandatory params
- 			if(server.isEmpty() || webapp.isEmpty() || params == null || params.isEmpty()){
+                         if(server.isEmpty() || webapp.isEmpty() || params == null || params.isEmpty()){
- 				logger.warn("<index update process> Insuficient info provided for data import");
+                                 logger.warn("<index update process> Insuficient info provided for data import");
- 				logger.info("<index update process> Reloading global dataimport.properties");
+                                 logger.info("<index update process> Reloading global dataimport.properties");
+                                 reloadParams();
- 				reloadParams();
- 			
- 			// single-core
- 			}else if(singleCore){
- 				prepUrlSendHttpPost();
  
- 			// multi-core
+                         // single-core
+                         }else if(singleCore){
+                                 prepUrlSendHttpPost();
+ 
+                         // multi-core
- 			}else if(syncCores.length == 0 || (syncCores.length == 1 && syncCores[0].isEmpty())){
+                         }else if(syncCores.length == 0 || (syncCores.length == 1 && syncCores[0].isEmpty())){
- 				logger.warn("<index update process> No cores scheduled for data import");
+                                 logger.warn("<index update process> No cores scheduled for data import");
- 				logger.info("<index update process> Reloading global dataimport.properties");
+                                 logger.info("<index update process> Reloading global dataimport.properties");
- 				reloadParams();
- 				
- 			}else{
- 				for(String core : syncCores){
- 					prepUrlSendHttpPost(core);
- 				}
- 			}
- 		}catch(Exception e){
+                                 reloadParams();
+ 
+                         }else{
+                                 for(String core : syncCores){
+                                         prepUrlSendHttpPost(core);
+                                 }
+                         }
+                 }catch(Exception e){
- 			logger.error("Failed to prepare for sendHttpPost", e);
+                         logger.error("Failed to prepare for sendHttpPost", e);
- 			reloadParams();
- 		}
- 	}
- 	
- 	
+                         reloadParams();
+                 }
+         }
+ 
+ 
- 	private void prepUrlSendHttpPost(){
+         private void prepUrlSendHttpPost(){
- 		String coreUrl = "http://" + server + ":" + port + "/" + webapp + params;
+                 String coreUrl = "http://" + server + ":" + port + "/" + webapp + params;
- 		sendHttpPost(coreUrl, null);
+                 sendHttpPost(coreUrl, null);
- 	}
- 	
+         }
+ 
- 	private void prepUrlSendHttpPost(String coreName){
+         private void prepUrlSendHttpPost(String coreName){
- 		String coreUrl = "http://" + server + ":" + port + "/" + webapp + "/" + coreName + params;
+                 String coreUrl = "http://" + server + ":" + port + "/" + webapp + "/" + coreName + params;
- 		sendHttpPost(coreUrl, coreName);	
+                 sendHttpPost(coreUrl, coreName);
- 	}
- 	
- 	
+         }
+ 
+ 
- 	private void sendHttpPost(String completeUrl, String coreName){
+         private void sendHttpPost(String completeUrl, String coreName){
- 		DateFormat df = new SimpleDateFormat("dd.MM.yyyy HH:mm:ss SSS");
+                 DateFormat df = new SimpleDateFormat("dd.MM.yyyy HH:mm:ss SSS");
- 		Date startTime = new Date();
+                 Date startTime = new Date();
  
- 		// prepare the core var
+                 // prepare the core var
- 		String core = coreName == null ? "" : "[" + coreName + "] ";
+                 String core = coreName == null ? "" : "[" + coreName + "] ";
- 		
+ 
- 		logger.info(core + "<index update process> Process started at .............. " + df.format(startTime));
+                 logger.info(core + "<index update process> Process started at .............. " + df.format(startTime));
- 		
- 		try{
  
+                 try{
+ 
- 		    URL url = new URL(completeUrl);
+                     URL url = new URL(completeUrl);
- 		    HttpURLConnection conn = (HttpURLConnection)url.openConnection();	
+                     HttpURLConnection conn = (HttpURLConnection)url.openConnection();
- 		    
+ 
- 		    conn.setRequestMethod("POST");
+                     conn.setRequestMethod("POST");
- 		    conn.setRequestProperty("type", "submit");
+                     conn.setRequestProperty("type", "submit");
- 		    conn.setDoOutput(true);
+                     conn.setDoOutput(true);
- 		    
- 			// Send HTTP POST		    
- 		    conn.connect();
- 		    
+ 
+                         // Send HTTP POST
+                     conn.connect();
+ 
- 		    logger.info(core + "<index update process> Request method\t\t\t" + conn.getRequestMethod());
+                     logger.info(core + "<index update process> Request method\t\t\t" + conn.getRequestMethod());
- 		    logger.info(core + "<index update process> Succesfully connected to server\t" + server);		    
+                     logger.info(core + "<index update process> Succesfully connected to server\t" + server);
- 		    logger.info(core + "<index update process> Using port\t\t\t" + port);
+                     logger.info(core + "<index update process> Using port\t\t\t" + port);
- 		    logger.info(core + "<index update process> Application name\t\t\t" + webapp);
+                     logger.info(core + "<index update process> Application name\t\t\t" + webapp);
- 		    logger.info(core + "<index update process> URL params\t\t\t" + params);
+                     logger.info(core + "<index update process> URL params\t\t\t" + params);
- 		    logger.info(core + "<index update process> Full URL\t\t\t\t" + conn.getURL());
+                     logger.info(core + "<index update process> Full URL\t\t\t\t" + conn.getURL());
- 		    logger.info(core + "<index update process> Response message\t\t\t" + conn.getResponseMessage());
+                     logger.info(core + "<index update process> Response message\t\t\t" + conn.getResponseMessage());
- 		    logger.info(core + "<index update process> Response code\t\t\t" + conn.getResponseCode());
+                     logger.info(core + "<index update process> Response code\t\t\t" + conn.getResponseCode());
- 		    
+ 
- 		    //listen for change in properties file if an error occurs 
+                     //listen for change in properties file if an error occurs
- 		    if(conn.getResponseCode() != 200){
+                     if(conn.getResponseCode() != 200){
- 		    	reloadParams();
- 		    }
- 		    
- 		    conn.disconnect();
+                         reloadParams();
+                     }
+ 
+                     conn.disconnect();
- 		    logger.info(core + "<index update process> Disconnected from server\t\t" + server);
+                     logger.info(core + "<index update process> Disconnected from server\t\t" + server);
- 		    Date endTime = new Date();
+                     Date endTime = new Date();
- 		    logger.info(core + "<index update process> Process ended at ................ " + df.format(endTime));
+                     logger.info(core + "<index update process> Process ended at ................ " + df.format(endTime));
- 		}catch(MalformedURLException mue){
+                 }catch(MalformedURLException mue){
- 			logger.error("Failed to assemble URL for HTTP POST", mue);
+                         logger.error("Failed to assemble URL for HTTP POST", mue);
- 		}catch(IOException ioe){
+                 }catch(IOException ioe){
- 			logger.error("Failed to connect to the specified URL while trying to send HTTP POST", ioe);
+                         logger.error("Failed to connect to the specified URL while trying to send HTTP POST", ioe);
- 		}catch(Exception e){
+                 }catch(Exception e){
- 			logger.error("Failed to send HTTP POST", e);
+                         logger.error("Failed to send HTTP POST", e);
- 		}
- 	}
+                 }
+         }
  
- 	public int getIntervalInt() {
+         public int getIntervalInt() {
- 		try{
- 			return Integer.parseInt(interval);	
+                 try{
+                         return Integer.parseInt(interval);
- 		}catch(NumberFormatException e){
+                 }catch(NumberFormatException e){
- 			logger.warn("Unable to convert 'interval' to number. Using default value (30) instead", e);
+                         logger.warn("Unable to convert 'interval' to number. Using default value (30) instead", e);
- 			return 30; //return default in case of error
+                         return 30; //return default in case of error
- 		}
- 	}	
+                 }
+         }
  }
  }}}
- 
  <<BR>>
+ 
  == dataimport.properties example ==
   * copy everything bellow ''dataimport scheduler properties'' to your ''dataimport.properties'' file and then change params
   * regardless of whether you have single or multiple-core Solr, use dataimport.properties located in your solr.home/conf (NOT solr.home/core/conf)
@@ -1422, +1447 @@

  
  
  #################################################
- #						#
+ #                                               #
- #	dataimport scheduler properties		#
+ #       dataimport scheduler properties         #
- #						#
+ #                                               #
  #################################################
  
  #  to sync or not to sync
@@ -1436, +1461 @@

  #  leave empty or comment it out if using single-core deployment
  syncCores=coreHr,coreEn
  
- #  solr server name or IP address 
+ #  solr server name or IP address
  #  [defaults to localhost if empty]
  server=localhost
  
@@ -1457, +1482 @@

  #  [defaults to 30 if empty]
  interval=10
  }}}
- 
  <<BR>>
+ 
  ------
  <<BR>>
  
  = Where to find it? =
  DataImportHandler is a new addition to Solr. You can either:
+ 
   * Download a nightly build of Solr from [[http://lucene.apache.org/solr/|Solr website]], or
   * Use the steps given in Full Import Example to try it out.
  
@@ -1475, +1501 @@

  
  = Troubleshooting =
   * If you are having trouble indexing international characters, try setting the '''encoding''' attribute to "UTF-8" on the dataSource element (example below). This should ensure that international character data (stored in UTF8) ingested by the given source will be preserved.
-    {{{
+   . {{{
     <dataSource type="FileDataSource" encoding="UTF-8"/>
-    }}}
+ }}}
  
  ----
  CategorySolrRequestHandler