You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2008/07/04 19:20:33 UTC

[Solr Wiki] Update of "DataImportHandler" by ShalinMangar

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The following page has been changed by ShalinMangar:
http://wiki.apache.org/solr/DataImportHandler

The comment on the change is:
Added details on writing custom Transformers

------------------------------------------------------------------------------
  = Extending the tool with APIs =
  The examples we explored are admittedly, trivial . It is not possible to have all user needs met by an xml configuration alone. So we expose a few interfaces which can be implemented by the user to enhance the functionality.
  
+ [[Anchor(transformer)]]
  == Transformer ==
  Every row that is fetched from the DB can be either consumed directly or it can be massaged to create a totally new set of fields or it can even return more than one row of data. The configuration must be done on an entity level as follows.
  {{{
@@ -502, +503 @@

  }}}
  The rules for the template are same as the templates in 'query', 'url' etc. it helps to concatenate multiple values or add extra characters to field for injection. Only appplies on fields which have a 'template' attribute.
  ==== Attributes ====
-  * '''`template`''' : The template string. In the above example there are two placeholders '${e.name}' and '${eparent.surname}' .   Both the values must be present when it is being evaluated. Else it will not be evaluated. 
+  * '''`template`''' : The template string. In the above example there are two placeholders '${e.name}' and '${eparent.surname}' .   Both the values must be present when it is being evaluated. Else it will not be evaluated.
+ 
+ [[Anchor(custom-transformers)]]
+ == Writing Custom Transformers ==
+ If you need any kind of custom processing before sending the row to Solr, you can write a transformer of your own. Let us take an example use-case. Suppose, you have a single-valued field named "artistName" in your schema which is of type="string" which you want to facet upon and therefore no index-time analysis should be done on this field. The value can contain multiple words like "Celine Dion" but there's a problem, your data contains extra leading and trailing whitespace which you want to remove. The !WhitespaceAnalyzer in Solr can't be applied since you don't want to tokenize the data into multiple tokens. A solution is to write a !TrimTransformer.
+ 
+ === A Simple TrimTransformer ===
+ {{{
+ public class TrimTransformer	{
+ 	public Object transformRow(Map<String, Object> row)	{
+ 		String artist = row.get("artist");
+ 		if (artist != null)		
+ 			row.put("ar", artist.trim());
+ 
+ 		return row;
+ 	}
+ }
+ }}}
+ No need to extend any interface. Just write any class which has a method named transformRow with the above signature and DataImportHandler will instantiate it and call the transformRow method using reflection. You will specify it in your data-config.xml as follows:
+ {{{
+ <entity name="artist" query="..." transformer="TrimTransformer">
+ 	<field column="artistName" />
+ </entity>
+ }}}
+ 
+ === A General TrimTransformer ===
+ Suppose you want to write a general !TrimTransformer without hardcoding the column on which it needs to operate. Now we'd need to have a flag on the field in data-config.xml to indicate that the !TrimTransformer should apply itself on this field.
+ {{{
+ <entity name="artist" query="..." transformer="TrimTransformer">
+ 	<field column="artistName" trim="true" />
+ </entity>
+ }}}
+ Now you'll need to extend the [#transformer Transformer] interface and use the API methods in Context to get the list of fields in the entity and get attributes of the fields to detect if the flag is set.
+ {{{
+ public class TrimTransformer extends Transformer	{
+ 
+ 	public Map<String, Object> transformRow(Map<String, Object> row, Context context) {
+ 		List<Map<String, String>> fields = context.getAllEntityFields();
+ 
+ 		for (Map<String, String> field : fields) {
+ 			// Check if this field has trim="true" specified in the data-config.xml
+ 			String trim = field.get("trim");
+ 			if (Boolean.parseBoolean(trim))	{
+ 				// Apply trim on this field
+ 				String columnName = field.get("column");
+ 				// Get this field's value from the current row
+ 				String value = row.get(columnName);
+ 				// Trim and put the updated value back in the current row
+ 				if (value != null)
+ 					row.put(columnName, value.trim());
+ 			}
+ 		}
+ 
+ 		return row;
+ 	}
+ 
+ }
+ }}}
+ If the field is multi-valued, then it returns a List instead of a single object and would need to handled appropriately. You'll need to add the jar for DataImportHandler to your project as a dependency to use the Transformer and Context interfaces.
+ 
  [[Anchor(entityprocessor)]]
  == EntityProcessor ==
  Each entity is handled by a default Entity processor called !SqlEntityProcessor. This works well for systems which use RDBMS as a datasource. For other kind of datasources like  REST or Non Sql datasources you can choose to implement this interface `org.apache.solr.handler.dataimport.Entityprocessor`. This is designed to Stream rows one by one from an entity. The simplest way to implement your own !EntityProcessor is to just extent !EntityProcessorBase and override the `public Map<String,Object> nextRow()` method.