You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tod <li...@gmail.com> on 2010/06/23 15:07:47 UTC

Indexing Rich Format Documents using Data Import Handler (DIH) and the TikaEntityProcessor

Please refer to this thread for history:

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201006.mbox/%3C4C1B6BB6.7010001@gmail.com%3E


I'm trying to integrate the TikaEntityProcessor as suggested.  I'm using 
Solr Version: 1.4.0 and getting the following error:

java.lang.ClassNotFoundException: Unable to load BinURLDataSource or 
org.apache.solr.handler.dataimport.BinURLDataSource

curl -s http://test.html|curl 
http://localhost:9080/solr/update/extract?extractOnly=true --data-binary 
@-  -H 'Content-type:text/html'

... works fine so presumably my Tika processor is working.


My data-config.xml looks like this:

<dataConfig>
   <dataSource type="JdbcDataSource"
     driver="oracle.jdbc.driver.OracleDriver"
     url="jdbc:oracle:thin:@whatever:12345:whatever"
     user="me"
     name="ds-db"
     password="secret"/>

   <dataSource type="BinURLDataSource"
     name="ds-url"/>

   <document>
     <entity name="my_database"
      dataSource="ds-db"
      query="select * from my_database where rownum &lt;=2">
       <field column="CONTENT_ID"                name="content_id"/>
       <field column="CMS_TITLE"                 name="cms_title"/>
       <field column="FORM_TITLE"                name="form_title"/>
       <field column="FILE_SIZE"                 name="file_size"/>
       <field column="KEYWORDS"                  name="keywords"/>
       <field column="DESCRIPTION"               name="description"/>
       <field column="CONTENT_URL"               name="content_url"/>
     </entity>

     <entity name="my_database_url"
      dataSource="ds-url"
      query="select CONTENT_URL from my_database where 
content_id='${my_database.CONTENT_ID}'">
      <entity processor="TikaEntityProcessor"
       dataSource="ds-url"
       format="text">
       url="http://www.mysite.com/${my_database.content_url}"
       <field column="text"/>
      </entity>
     </entity>

   </document>
</dataConfig>

I added the entity name="my_database_url" section to an existing 
(working) database entity to be able to have Tika index the content 
pointed to by the content_url.

Is there anything obviously wrong with what I've tried so far because 
this is not working, it keeps rolling back with the error above.


Thanks - Tod