You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2012/12/14 18:38:34 UTC
[Solr Wiki] Update of "DataImportHandler" by PeterTyrrell

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "DataImportHandler" page has been changed by PeterTyrrell:
http://wiki.apache.org/solr/DataImportHandler?action=diff&rev1=330&rev2=331

  <<Anchor(commands)>> The handler exposes all its API as http requests . The following are the possible operations
  
   * '''full-import''' : Full Import operation can be started by hitting the URL `http://<host>:<port>/solr/dataimport?command=full-import`
- 
    * This operation will be started in a new thread and the ''status'' attribute in the response should be shown ''busy'' now.
    * The operation may take some time depending on size of dataset.
-   * When full-import command is executed, it stores the start time of the operation in a file located at ''conf/dataimport.properties''  ([[#Configuring The Property Writer|this file is configurable]])
+   * When full-import command is executed, it stores the start time of the operation in a file located at ''conf/dataimport.properties''  ([[#Configuring_The_Property_Writer|this file is configurable]])
    * This stored timestamp is used when a delta-import operation is executed.
    * Queries to Solr are not blocked during full-imports.
    * It takes in extra parameters:
     * '''entity''' : Name of an entity directly under the <document> tag. Use this to execute one or more entities selectively. Multiple 'entity' parameters can be passed on to run multiple entities at once. If nothing is passed, all entities are executed.
     * '''clean''' : (default 'true'). Tells whether to clean up the index before the indexing is started.
     * '''commit''' : (default 'true'). Tells whether to commit after the operation.
-    * '''optimize''' : (default 'true' up to Solr 3.6, 'false' afterwards). Tells whether to optimize after the operation. Please note: this can be a very expensive operation and usually does not make sense for delta-imports. 
+    * '''optimize''' : (default 'true' up to Solr 3.6, 'false' afterwards). Tells whether to optimize after the operation. Please note: this can be a very expensive operation and usually does not make sense for delta-imports.
     * '''debug''' : (default 'false'). Runs in debug mode. It is used by the interactive development mode ([[#interactive|see here]]).
- 
      * Please note that in debug mode, documents are never committed automatically. If you want to run debug mode and commit the results too, add 'commit=true' as a request parameter.
   * '''delta-import''' : For incremental imports and change detection run the command `http://<host>:<port>/solr/dataimport?command=delta-import` . It supports the same clean, commit, optimize and debug parameters as full-import command.
   * '''status''' : To know the status of the current command, hit the URL `http://<host>:<port>/solr/dataimport` . It gives an elaborate statistics on no. of docs created, deleted, queries run, rows fetched, status etc.
@@ -339, +337 @@

   . {{{
   deltaQuery="SELECT MAX(did) FROM ${dataimporter.request.dataView}"
  }}}
- 
    . Changed to:
- 
   {{{
   deltaQuery="SELECT MAX(did) AS did FROM ${dataimporter.request.dataView}"
  }}}
  
  = Configuring The Property Writer =
- <!> [[Solr4.1]] 
- Add the tag 'propertyWriter' directly under the 'dataConfig' tag.  The property "last_index_time" is converted to text and stored in the properties file and is available for the next import as the variable '${dih.last_index_time}' . This tag gives control over how this properties file is written.
+ <!> [[Solr4.1]]  Add the tag 'propertyWriter' directly under the 'dataConfig' tag.  The property "last_index_time" is converted to text and stored in the properties file and is available for the next import as the variable '${dih.last_index_time}' . This tag gives control over how this properties file is written.
  
  {{{
  <propertyWriter dateFormat="yyyy-MM-dd HH:mm:ss" type="SimplePropertiesWriter" directory="data" filename="my_dih.properties" locale="en_US" />
  }}}
-  * This tag is optional, resulting in the default locale,directory and filename.  The 'type' will default to SimplePropertiesWriter for non-SolrCloud installations.  For SolrCloud, ZKPropertiesWriter is default. 
+  * This tag is optional, resulting in the default locale,directory and filename.  The 'type' will default to SimplePropertiesWriter for non-SolrCloud installations.  For SolrCloud, ZKPropertiesWriter is default.
   * 'type' - the implementation class.  This is required unless <propertyWriter /> is omitted entirely.
   * 'filename' - (SimplePropertiesWriter) The default is the name of the request handler followed by ".properties", for instance, dataimport.properties
   * 'directory' -(SimplePropertiesWriter) The default is "conf".
@@ -532, +527 @@

  A transformer can be used to alter the value of a field fetched from the datasource or to populate an undefined field. If the action of the transformer fails, say a regex fails to match, then an existing field will be unaltered and an undefined field will remain undefined. The chaining effect described above allows a column's value to be altered again and again by successive transformers. A transformer may make use of other entity fields in the course of massaging a columns value.
  
  === RegexTransformer ===
- There is an built-in transformer called '!RegexTransfromer' provided with DIH. It helps in extracting or manipulating values from fields (from the source) using Regular Expressions. The actual class name is `org.apache.solr.handler.dataimport.RegexTransformer`. But as it belongs to the default package the package-name can be omitted.
+ There is an built-in transformer called '!RegexTransformer' provided with DIH. It helps in extracting or manipulating values from fields (from the source) using Regular Expressions. The actual class name is `org.apache.solr.handler.dataimport.RegexTransformer`. But as it belongs to the default package the package-name can be omitted.
  
  '''Attributes'''
  
- !RegexTransfromer is only activated for fields with an attribute of 'regex' or 'splitBy'. Other fields are ignored.
+ !RegexTransformer is only activated for fields with an attribute of 'regex' or 'splitBy'. Other fields are ignored.
  
   * '''`regex`''' : The regular expression that is used to match against the column or sourceColName's value(s). If `replaceWith` is absent, each regex ''group'' is taken as a value and a list of values is returned
   * '''`sourceColName`''' : The column on which the regex is to be applied. If this is absent source and target are same
@@ -562, +557 @@

  In this example the attributes 'regex' and 'sourceColName' are custom attributes used by the transformer. It reads the field 'full_name' from the resultset and transforms it to two new target fields 'firstName' and 'lastName'. So even though the query returned only one column 'full_name' in the resultset the solr document gets two extra fields 'firstName' and 'lastName' which are 'derived' fields. These new fields are only created if the regexp matches.
  
  The 'emailids' field in the table can be a comma separated value. So it ends up giving out one or more than one email ids and we expect the 'mailId' to be a multivalued field in Solr.
+ 
+ The regular expression matching is case-sensitive by default. Use the (?i) and/or (?u) embedded flags (u enables Unicode case-folding, i is US-ASCII only) to indicate that all or a portion of the expression should be case-insensitive. Other flags and behaviours can be set according to Java's regex flavour, cf. `java.util.regex`.
+ 
+ {{{
+ <!-- matches Apples and apples -->
+ <field column="just_apples" regex="(?iu)(apples)" />
+ }}}
  
  <!> Note that this transformer can either be used to split a string into tokens based on a '''`splitBy`''' pattern, or to perform a string substitution as per '''`replaceWith`''', or it can assign groups within a pattern to a list of '''`groupNames`'''. It decides what it is to do based upon the above attributes '''`splitBy`''', '''`replaceWith`''' and  '''`groupNames`''' which are looked for in order. This first one found is acted upon and other unrelated attributes are ignored.
  
@@ -640, +642 @@

  {{{
  <field column="price" formatStyle="number" />
  }}}
+ By default, !NumberFormat uses the system's default locale to parse the given string.  Optionally, specify the Locale to use as shown (see java.util.Locale javadoc for more information):
- By default, !NumberFormat uses the system's default locale to parse the given string. 
- Optionally, specify the Locale to use as shown (see java.util.Locale javadoc for more information):
  
  {{{
  <field column="price" formatStyle="number" locale="de-DE" />
  }}}
- 
  '''Attributes'''
  
  !NumberFormatTransformer applies only on the fields with an attribute 'formatStyle' .
  
   * '''`formatStyle`''' : The format used for parsing this field The value of the attribute must be one of (number|percent|integer|currency). This uses the semantics of java [[http://java.sun.com/j2se/1.4.2/docs/api/java/text/NumberFormat.html|NumberFormat]].
   * '''`sourceColName`''' : The column on which the !NumberFormat is to be applied. If this is absent, source and target are same.
-  * '''`locale`''' : The locale to be used for parsing the strings. If no Locale is specified, Solr4.1 and later defaults to the ROOT Locale (Versions prior to Solr4.1 use the current machine's default Locale.)  
+  * '''`locale`''' : The locale to be used for parsing the strings. If no Locale is specified, Solr4.1 and later defaults to the ROOT Locale (Versions prior to Solr4.1 use the current machine's default Locale.)
  
  === TemplateTransformer ===
  Can be used to overwrite or modify any existing Solr field or to create new Solr fields. The value assigned to the field is based on a static template string, which can contain DIH variables. If a template string contains placeholders or variables they must be defined when the transformer is being evaluated. An undefined variable causes the entire template instruction to be ignored. eg:
@@ -863, +863 @@

  In the where the lhs (the part before '=') is the column in y and the rhs (the part after '=') is the value to be computed for looking up the cache.
  
  An alternate syntax to Example 2 above uses the "cacheKey" and "cacheLookup" parameters:
+ 
  {{{
  <entity name="x" query="select * from x">
      <entity name="y" query="select * from y" processor="CachedSqlEntityProcessor" cacheKey="xid" cacheLookup="x.id">
@@ -1062, +1063 @@

   * On doing a `command=full-import` The root-entity (A) is executed first
   * Each row that emitted by the 'query' in entity 'A' is fed into its sub entities B, C
   * The queries in B and C use a column in 'A' to construct their queries using placeholders like `${A.a}`
- 
    * B has a url  (B is an xml/http datasource)
    * C has a query
   * C has two transformers ('f' and 'g' )
@@ -1087, +1087 @@

  While the namespace concept is useful , the user may want to put some computed value into the query or url for example there is a Date object and your datasource accepts Date in some custom format.
  
  === formatDate ===
-  Use this to format dates as strings.  It takes three parameters (prior to Solr 4.1, it takes two):
+  . Use this to format dates as strings.  It takes three parameters (prior to Solr 4.1, it takes two):
    1. A variable that refers to a date, or a datemath expression.
-   2. A date format string.  See java.text.SimpleDateFormat javadoc for valid date formats. (Solr 4.1 and later, this must be enclosed in single quotes.  Solr 1.4 - 4.0, quotes are optional.  Prior to Solr 1.4, this must not be enclosed in single quotes)
+   1. A date format string.  See java.text.SimpleDateFormat javadoc for valid date formats. (Solr 4.1 and later, this must be enclosed in single quotes.  Solr 1.4 - 4.0, quotes are optional.  Prior to Solr 1.4, this must not be enclosed in single quotes)
-   3. <!> [[Solr4.1]] (optional)  The locale code to use when formatting dates, enclosed in single quotes. See java.util.Locale javadoc for details.  If omitted, this defaults to the ROOT Locale. (Note: prior to Solr 4.1, formatDate would always use the current machine's default locale.)
+   1. <!> [[Solr4.1]] (optional)  The locale code to use when formatting dates, enclosed in single quotes. See java.util.Locale javadoc for details.  If omitted, this defaults to the ROOT Locale. (Note: prior to Solr 4.1, formatDate would always use the current machine's default locale.)
- 
  
   * example using a variable:  `'${dataimporter.functions.formatDate(item.ID, 'yyyy-MM-dd HH:mm')}'`
   * example using a datemmath expression:  `'${dataimporter.functions.formatDate('NOW-3DAYS', 'yyyy-MM-dd HH:mm')}'`
@@ -1116, +1115 @@

    </document>
  </dataConfig>
  }}}
- The implementation of !LowerCaseFunctionEvaluator 
+ The implementation of !LowerCaseFunctionEvaluator
  
  <!> [[Solr4.1]] this example depends on API modifications made in Solr 4.1
+ 
  {{{
    public class LowerCaseFunctionEvaluator extends Evaluator{
      public String evaluate(String expression, Context context) {
@@ -1136, +1136 @@

  <<Anchor(interactive)>>
  
  = Interactive Development Mode =
- 
  /!\ '''NOTE:''' The Interactive 'debug' mode only exists in Solr 3.x.  It has not yet been implemented in Solr 4.x (see [[https://issues.apache.org/jira/browse/SOLR-4151|SOLR-4151]])
  
  This is a new cool and powerful feature in the tool. It helps you build a dataconfig.xml with the UI. It can be accessed from http://host:port/solr/admin/dataimport.jsp . The features are
@@ -1266, +1265 @@

   * uses ''HTTPPostScheduler'', [[http://download.oracle.com/javase/6/docs/api/java/util/Timer.html|java.util.Timer]] and context attribute map to facilitate periodic method invocation (scheduling)
   * Timer is essentially a facility for threads to schedule tasks for future execution in a background thread.
   * Don't forget to add the following listener declaration to Solr's web.xml:<<BR>>
+ 
  {{{
   <listener>
     <listener-class>org.apache.solr.handler.dataimport.scheduler.ApplicationListener</listener-class>
   </listener>
  }}}
   * In order to make Scheduler classes available to DIH you need to place downloaded jar file to your solr.war's web-inf\lib folder (you can either alter the war archive before deploying it or you can place jar file in deployed, unpacked {{{lib}}} folder under your web server's (typically) {{{webapps}}} folder)
+ 
  {{{
  package org.apache.solr.handler.dataimport.scheduler;
  
@@ -1585, +1586 @@

  
  = Troubleshooting =
   * If you are having trouble indexing international characters, try setting the '''encoding''' attribute to "UTF-8" on the dataSource element (example below). This should ensure that international character data (stored in UTF8) ingested by the given source will be preserved.
- 
    . {{{
     <dataSource type="FileDataSource" encoding="UTF-8"/>
  }}}