You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-commits@lucene.apache.org by Apache Wiki <wi...@apache.org> on 2008/09/18 10:14:22 UTC
[Solr Wiki] Update of "DataImportHandler" by ShalinMangar
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The following page has been changed by ShalinMangar:
http://wiki.apache.org/solr/DataImportHandler
The comment on the change is:
Added wikipedia example
------------------------------------------------------------------------------
You can use this feature for indexing from REST API's such as rss/atom feeds, XML data feeds , other Solr servers or even well formed xhtml documents . Our XPath support has its limitations (no wildcards , only fullpath etc) but we have tried to make sure that common use-cases are covered and since it's based on a streaming parser, it is extremely fast and consumes constant amount of memory even for large XMLs. It does not support namespaces , but it can handle xmls with namespaces . When you provide the xpath, just drop the namespace and give the rest (eg if the tag is `'<dc:subject>'` the mapping should just contain `'subject'`).Easy, isn't it? And you didn't need to write one line of code! Enjoy :)
- /!\ Note : Unlike with database , it is not possible to omit the field declarations if you are using X!PathEntityProcessor. It relies on the xpaths declared in the fields to identify what to extract from the xml.
+ /!\ Note : Unlike with database , it is not possible to omit the field declarations if you are using X!PathEntityProcessor. It relies on the xpaths declared in the fields to identify what to extract from the xml.
+
+ [[Anchor(wikipedia)]]
+ == Example: Indexing wikipedia ==
+ The following data-config.xml was used to index a full (en-articles, recent only) [http://download.wikimedia.org/enwiki/20080724/ wikipedia dump]. The file downloaded from wikipedia was the pages-articles.xml.bz2 which when uncompressed is around 18GB on disk.
+
+ {{{
+ <dataConfig>
+ <dataSource type="FileDataSource" encoding="UTF-8" />
+ <document>
+ <entity name="page" processor="XPathEntityProcessor" stream="true" forEach="/mediawiki/page/" url="/data/enwiki-20080724-pages-articles.xml">
+ <field column="id" xpath="/mediawiki/page/id" />
+ <field column="title" xpath="/mediawiki/page/title" />
+ <field column="revision" xpath="/mediawiki/page/revision/id" />
+ <field column="user" xpath="/mediawiki/page/revision/contributor/username" />
+ <field column="userId" xpath="/mediawiki/page/revision/contributor/id" />
+ <field column="text" xpath="/mediawiki/page/revision/text" />
+ <field column="timestamp" xpath="/mediawiki/page/revision/timestamp" />
+ </entity>
+ </document>
+ </dataConfig>7278241
+ }}}
+ The relevant portion of schema.xml is below:
+ {{{
+ <field name="id" type="integer" indexed="true" stored="true" required="true"/>
+ <field name="title" type="string" indexed="true" stored="false"/>
+ <field name="revision" type="sint" indexed="true" stored="true"/>
+ <field name="user" type="string" indexed="true" stored="true"/>
+ <field name="userId" type="integer" indexed="true" stored="true"/>
+ <field name="text" type="text" indexed="true" stored="false"/>
+ <field name="timestamp" type="date" indexed="true" stored="true"/>
+ <field name="titleText" type="text" indexed="true" stored="true"/>
+ ...
+ <uniqueKey>id</uniqueKey>
+ <copyField source="title" dest="titleText"/>
+ }}}
+
+ Time taken was around 2 hours 40 minutes to index 7278241 articles with peak memory usage at around 4GB.
+
= Extending the tool with APIs =
The examples we explored are admittedly, trivial . It is not possible to have all user needs met by an xml configuration alone. So we expose a few abstract class which can be implemented by the user to enhance the functionality.