You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Chris Harris <ry...@gmail.com> on 2009/02/05 00:45:28 UTC

Latest on DataImportHandler and Tika?

Back in November, Shalin and Grant were discussing integrating
DataImportHandler and Tika. Shalin's estimation about the best way to
do this was as follows:

**

I think the best way would be a TikaEntityProcessor which knows how to
handle documents. I guess a typical use-case would be
FileListEntityProcessor->TikaEntityProcessor as parent-child entities.

Also see SOLR-833 which adds a FieldReaderDataSource using which you can
pass any field's content to an entity for processing. So you can have a
[SqlEntityProcessor, JdbcDataSource] producing a blob and a
[FieldReaderDataSource, TikaEntityProcessor] consuming it.

(http://www.nabble.com/DataImportHandler-and-Blobs-td20464891.html)

**

Has there been any work on something like this? Alternatively, is
anyone else put together an alternative way to get DataImportHandler
to extract body text from PDFs, Word files, etc.?

Thanks,
Chris

Re: Latest on DataImportHandler and Tika?

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.
We have not taken up anything yet. The idea is to create another
contrib which will contain extensions to DIH which has external
dependencies as SOLR-934.
TikaEntityProcessor is something we wish to do but our limited
bandwidth has been the problem

On Thu, Feb 5, 2009 at 5:15 AM, Chris Harris <ry...@gmail.com> wrote:
> Back in November, Shalin and Grant were discussing integrating
> DataImportHandler and Tika. Shalin's estimation about the best way to
> do this was as follows:
>
> **
>
> I think the best way would be a TikaEntityProcessor which knows how to
> handle documents. I guess a typical use-case would be
> FileListEntityProcessor->TikaEntityProcessor as parent-child entities.
>
> Also see SOLR-833 which adds a FieldReaderDataSource using which you can
> pass any field's content to an entity for processing. So you can have a
> [SqlEntityProcessor, JdbcDataSource] producing a blob and a
> [FieldReaderDataSource, TikaEntityProcessor] consuming it.
>
> (http://www.nabble.com/DataImportHandler-and-Blobs-td20464891.html)
>
> **
>
> Has there been any work on something like this? Alternatively, is
> anyone else put together an alternative way to get DataImportHandler
> to extract body text from PDFs, Word files, etc.?
>
> Thanks,
> Chris
>



-- 
--Noble Paul