You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Noble Paul (JIRA)" <ji...@apache.org> on 2009/12/08 07:38:18 UTC

[jira] Issue Comment Edited: (SOLR-1358) Integration of Tika and DataImportHandler

    [ https://issues.apache.org/jira/browse/SOLR-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12750855#action_12750855 ] 

Noble Paul edited comment on SOLR-1358 at 12/8/09 6:36 AM:
-----------------------------------------------------------

Let us provide a new TikaEntityProcessor 

{code:xml}
<dataConfig>
 <!-- use any of type DataSource<InputStream> --> 
  <dataSource type="BinURLDataSource"/>
  <document>
    <entity processor="TikaEntityProcessor" tikaConfig="tikaconfig.xml" url="${some.var.goes.here}">
     </entity>
  <document>
</dataConfig>
{code}

This most likely would need a BinUrlDataSource/BinContentStreamDataSource because Tika uses binary inputs.

My suggestion is that TikaEntityProcessor live in the extraction contrib so that managing dependencies is easier. But we will have to make extraction have a compile-time dependency on DIH. 

Grant , what do you think?

      was (Author: noble.paul):
    Let us provide a new TikaEntityProcessor 

{code:xml}
<entity processor="TikaEntityProcessor" tikaConfig="tikaconfig.xml" url="${some.var.goes.here}">
</entity>
{code}

This most likely would need a BinUrlDataSource/BinContentStreamDataSource because Tika uses binary inputs.

My suggestion is that TikaEntityProcessor live in the extraction contrib so that managing dependencies is easier. But we will have to make extraction have a compile-time dependency on DIH. 

Grant , what do you think?
  
> Integration of Tika and DataImportHandler
> -----------------------------------------
>
>                 Key: SOLR-1358
>                 URL: https://issues.apache.org/jira/browse/SOLR-1358
>             Project: Solr
>          Issue Type: New Feature
>          Components: contrib - DataImportHandler
>            Reporter: Sascha Szott
>            Assignee: Noble Paul
>
> At the moment, it's impossible to configure Solr such that it build up documents by using data that comes from both pdf documents and database table columns. Currently, to accomplish this task, it's up to the user to add some preprocessing that converts pdf files into plain text files. Therefore, I would like to see an integration of Solr Cell into DIH that makes those preprocessing obsolete.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.