You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Robert Muir (Jira)" <ji...@apache.org> on 2019/12/04 13:29:00 UTC

[jira] [Commented] (SOLR-7633) Change the ExtractingRequestHandler to use Tika-Server

    [ https://issues.apache.org/jira/browse/SOLR-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16987865#comment-16987865 ] 

Robert Muir commented on SOLR-7633:
-----------------------------------

trying to resurrect interest in this ancient issue.

tika has its own server: so it seems like the integration could be really simplified (either server-side, or client-side) to just use tika's server and then index the result. have not looked at tika's API there, but probably easy to simply mock its responses for tests, and TONS of third party dependencies go away.

> Change the ExtractingRequestHandler to use Tika-Server
> ------------------------------------------------------
>
>                 Key: SOLR-7633
>                 URL: https://issues.apache.org/jira/browse/SOLR-7633
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Solr Cell (Tika extraction)
>            Reporter: Chris A. Mattmann
>            Priority: Major
>              Labels: memex
>             Fix For: 5.0.1
>
>
> It's a pain to upgrade Tika's jars all the times when we release, and if Tika fails it messes up the ExtractingRequestHandler (e.g., the document type caused Tika to fail, etc). A more reliable way and also separated, and easier to deploy version of the ExtractingRequestHandler would make a network call to the Tika JAXRS server, and then call Tika on the Solr server side, get the results and then index the information that way. I have a patch in the works from the DARPA Memex project and I hope to post it soon.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org