You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tod <li...@gmail.com> on 2010/06/23 15:07:47 UTC
Indexing Rich Format Documents using Data Import Handler (DIH) and
the TikaEntityProcessor
Please refer to this thread for history:
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201006.mbox/%3C4C1B6BB6.7010001@gmail.com%3E
I'm trying to integrate the TikaEntityProcessor as suggested. I'm using
Solr Version: 1.4.0 and getting the following error:
java.lang.ClassNotFoundException: Unable to load BinURLDataSource or
org.apache.solr.handler.dataimport.BinURLDataSource
curl -s http://test.html|curl
http://localhost:9080/solr/update/extract?extractOnly=true --data-binary
@- -H 'Content-type:text/html'
... works fine so presumably my Tika processor is working.
My data-config.xml looks like this:
<dataConfig>
<dataSource type="JdbcDataSource"
driver="oracle.jdbc.driver.OracleDriver"
url="jdbc:oracle:thin:@whatever:12345:whatever"
user="me"
name="ds-db"
password="secret"/>
<dataSource type="BinURLDataSource"
name="ds-url"/>
<document>
<entity name="my_database"
dataSource="ds-db"
query="select * from my_database where rownum <=2">
<field column="CONTENT_ID" name="content_id"/>
<field column="CMS_TITLE" name="cms_title"/>
<field column="FORM_TITLE" name="form_title"/>
<field column="FILE_SIZE" name="file_size"/>
<field column="KEYWORDS" name="keywords"/>
<field column="DESCRIPTION" name="description"/>
<field column="CONTENT_URL" name="content_url"/>
</entity>
<entity name="my_database_url"
dataSource="ds-url"
query="select CONTENT_URL from my_database where
content_id='${my_database.CONTENT_ID}'">
<entity processor="TikaEntityProcessor"
dataSource="ds-url"
format="text">
url="http://www.mysite.com/${my_database.content_url}"
<field column="text"/>
</entity>
</entity>
</document>
</dataConfig>
I added the entity name="my_database_url" section to an existing
(working) database entity to be able to have Tika index the content
pointed to by the content_url.
Is there anything obviously wrong with what I've tried so far because
this is not working, it keeps rolling back with the error above.
Thanks - Tod