You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tod <li...@gmail.com> on 2010/07/06 14:03:03 UTC

Re: Data Import Handler Rich Format Documents

On 6/28/2010 8:28 AM, Alexey Serba wrote:
>> Ok, I'm trying to integrate the TikaEntityProcessor as suggested. �I'm using
>> Solr Version: 1.4.0 and getting the following error:
>>
>> java.lang.ClassNotFoundException: Unable to load BinURLDataSource or
>> org.apache.solr.handler.dataimport.BinURLDataSource
> It seems that DIH-Tika integration is not a part of Solr 1.4.0/1.4.1
> release. You should use trunk / nightly builds.
> https://issues.apache.org/jira/browse/SOLR-1583
> 

Thanks, that would explain things - I'm using a stock 1.4.0 download.


>> My data-config.xml looks like this:
>>
>> <dataConfig>
>> �<dataSource type="JdbcDataSource"
>> � �driver="oracle.jdbc.driver.OracleDriver"
>> � �url="jdbc:oracle:thin:@whatever:12345:whatever"
>> � �user="me"
>> � �name="ds-db"
>> � �password="secret"/>
>>
>> �<dataSource type="BinURLDataSource"
>> � �name="ds-url"/>
>>
>> �<document>
>> � �<entity name="my_database"
>> � � dataSource="ds-db"
>> � � query="select * from my_database where rownum &lt;=2">
>> � � �<field column="CONTENT_ID" � � � � � � � �name="content_id"/>
>> � � �<field column="CMS_TITLE" � � � � � � � � name="cms_title"/>
>> � � �<field column="FORM_TITLE" � � � � � � � �name="form_title"/>
>> � � �<field column="FILE_SIZE" � � � � � � � � name="file_size"/>
>> � � �<field column="KEYWORDS" � � � � � � � � �name="keywords"/>
>> � � �<field column="DESCRIPTION" � � � � � � � name="description"/>
>> � � �<field column="CONTENT_URL" � � � � � � � name="content_url"/>
>> � �</entity>
>>
>> � �<entity name="my_database_url"
>> � � dataSource="ds-url"
>> � � query="select CONTENT_URL from my_database where
>> content_id='${my_database.CONTENT_ID}'">
>> � � <entity processor="TikaEntityProcessor"
>> � � �dataSource="ds-url"
>> � � �format="text">
>> � � �url="http://www.mysite.com/${my_database.content_url}"
>> � � �<field column="text"/>
>> � � </entity>
>> � �</entity>
>>
>> �</document>
>> </dataConfig>
>>
>> I added the entity name="my_database_url" section to an existing (working)
>> database entity to be able to have Tika index the content pointed to by the
>> content_url.
>>
>> Is there anything obviously wrong with what I've tried so far?
> 
> I think you should move Tika entity into my_database entity and
> simplify the whole configuration
> 
> <entity name="my_database" dataSource="ds-db" query="select * from
> my_database where rownum &lt;=2">
>     ...
>     <field column="CONTENT_URL"               name="content_url"/>
> 
>     <entity processor="TikaEntityProcessor" dataSource="ds-url"
> format="text" url="http://www.mysite.com/${my_database.content_url}"
>         <field column="text"/>
>     </entity>
> </entity>
> 

This, I guess, would be after I checked out and built from trunk?


Thanks - Tod