You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2008/09/18 10:14:22 UTC
[Solr Wiki] Update of "DataImportHandler" by ShalinMangar

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The following page has been changed by ShalinMangar:
http://wiki.apache.org/solr/DataImportHandler

The comment on the change is:
Added wikipedia example

------------------------------------------------------------------------------
  
  You can use this feature for indexing from REST API's such as rss/atom feeds, XML data feeds , other Solr servers or even well formed xhtml documents . Our XPath support has its limitations (no wildcards , only fullpath etc) but we have tried to make sure that common use-cases are covered and since it's based on a streaming parser, it is extremely fast and consumes constant amount of memory even for large XMLs. It does not support namespaces , but it can handle xmls with namespaces . When you provide the xpath, just drop the namespace and give the rest (eg if the tag is `'<dc:subject>'` the mapping should just contain `'subject'`).Easy, isn't it? And you didn't need to write one line of code! Enjoy :)
  
- /!\ Note : Unlike with database , it is not possible to omit the field declarations if you are using X!PathEntityProcessor. It relies on the xpaths declared in the fields to identify what to extract from the xml. 
+ /!\ Note : Unlike with database , it is not possible to omit the field declarations if you are using X!PathEntityProcessor. It relies on the xpaths declared in the fields to identify what to extract from the xml.
+ 
+ [[Anchor(wikipedia)]]
+ == Example: Indexing wikipedia ==
+ The following data-config.xml was used to index a full (en-articles, recent only) [http://download.wikimedia.org/enwiki/20080724/ wikipedia dump]. The file downloaded from wikipedia was the pages-articles.xml.bz2 which when uncompressed is around 18GB on disk.
+ 
+ {{{
+ <dataConfig>
+         <dataSource type="FileDataSource" encoding="UTF-8" />
+         <document>
+         <entity name="page" processor="XPathEntityProcessor" stream="true" forEach="/mediawiki/page/" url="/data/enwiki-20080724-pages-articles.xml">
+                 <field column="id" xpath="/mediawiki/page/id" />
+                 <field column="title" xpath="/mediawiki/page/title" />
+                 <field column="revision" xpath="/mediawiki/page/revision/id" />
+                 <field column="user" xpath="/mediawiki/page/revision/contributor/username" />
+                 <field column="userId" xpath="/mediawiki/page/revision/contributor/id" />
+                 <field column="text" xpath="/mediawiki/page/revision/text" />
+                 <field column="timestamp" xpath="/mediawiki/page/revision/timestamp" />
+         </entity>
+         </document>
+ </dataConfig>7278241 
+ }}}
+ The relevant portion of schema.xml is below:
+ {{{
+ <field name="id" type="integer" indexed="true" stored="true" required="true"/>
+ <field name="title" type="string" indexed="true" stored="false"/>
+ <field name="revision" type="sint" indexed="true" stored="true"/>
+ <field name="user" type="string" indexed="true" stored="true"/>
+ <field name="userId" type="integer" indexed="true" stored="true"/>
+ <field name="text" type="text" indexed="true" stored="false"/>
+ <field name="timestamp" type="date" indexed="true" stored="true"/>
+ <field name="titleText" type="text" indexed="true" stored="true"/>
+ ...
+ <uniqueKey>id</uniqueKey>
+ <copyField source="title" dest="titleText"/>
+ }}}
+ 
+ Time taken was around 2 hours 40 minutes to index 7278241 articles with peak memory usage at around 4GB.
+  
  = Extending the tool with APIs =
  The examples we explored are admittedly, trivial . It is not possible to have all user needs met by an xml configuration alone. So we expose a few abstract class which can be implemented by the user to enhance the functionality.