You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/12/04 14:51:00 UTC

[jira] [Commented] (SOLR-7632) Change the ExtractingRequestHandler to use Tika-Server

    [ https://issues.apache.org/jira/browse/SOLR-7632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16276888#comment-16276888 ] 

Tim Allison commented on SOLR-7632:
-----------------------------------

bq. To carry out Erik Hatcher's recommendation...I don't know if we'd need CORS for this or not, but it might be neat to modify Tika's server to allow users to inject their own resources=endpoints via a config file and an extra jar. Within the Solr project, we'd just have to implement a resource that takes an input stream, runs Tika and then adds a SolrInputDocument.

[~gostep] has proposed allowing users to configure a custom ContentHandler in tika-server.  This could enable Solr to create its own content handler that tika-server could use to send the extracted text to Solr on endDocument().

> Change the ExtractingRequestHandler to use Tika-Server
> ------------------------------------------------------
>
>                 Key: SOLR-7632
>                 URL: https://issues.apache.org/jira/browse/SOLR-7632
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - Solr Cell (Tika extraction)
>            Reporter: Chris A. Mattmann
>              Labels: gsoc2017, memex
>
> It's a pain to upgrade Tika's jars all the times when we release, and if Tika fails it messes up the ExtractingRequestHandler (e.g., the document type caused Tika to fail, etc). A more reliable way and also separated, and easier to deploy version of the ExtractingRequestHandler would make a network call to the Tika JAXRS server, and then call Tika on the Solr server side, get the results and then index the information that way. I have a patch in the works from the DARPA Memex project and I hope to post it soon.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org