You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Frederik Van Hoyweghen <fr...@chapoo.com> on 2018/02/08 11:47:46 UTC

Opinions on ExtractingRequestHandler

Hey everyone,

What are your experiences on making (in production) use of Solr's
ExtractingRequestHandler?

I've been reading some mixed remarks so I was wondering what your actual
experiences with it are.

Personally, I feel like setting up a separate service which is solely
responsible for parsing file contents (to be indexed by Solr later on in
the process) using Tika is a safer approach, so we can use whatever Tika
version we want along with other things we might want to add.

Looking forward to your response!

Kind regards,
Frederik

Re: Opinions on ExtractingRequestHandler

Posted by "Sreenivas.T" <sr...@gmail.com>.

Frederik,

We have also used separate service, which uses tika & then use solrj to
index the content.
The main reason, why we went for this approach is to have flexibility to
manipulate/transform data over and above what tika does.

What I understand is that, if there is no other transformation needed
"ExtractingRequestHandler"
should be fine in production too.

Regards,
Sreenivas

On 8 February 2018 at 17:17, Frederik Van Hoyweghen <
frederik.vanhoyweghen@chapoo.com> wrote:

> Hey everyone,
>
> What are your experiences on making (in production) use of Solr's
> ExtractingRequestHandler?
>
> I've been reading some mixed remarks so I was wondering what your actual
> experiences with it are.
>
> Personally, I feel like setting up a separate service which is solely
> responsible for parsing file contents (to be indexed by Solr later on in
> the process) using Tika is a safer approach, so we can use whatever Tika
> version we want along with other things we might want to add.
>
> Looking forward to your response!
>
> Kind regards,
> Frederik
>

Re: Opinions on ExtractingRequestHandler

Posted by Charlie Hull <ch...@flax.co.uk>.

On 08/02/2018 11:47, Frederik Van Hoyweghen wrote:
> Hey everyone,
> 
> What are your experiences on making (in production) use of Solr's
> ExtractingRequestHandler?
> 
> I've been reading some mixed remarks so I was wondering what your actual
> experiences with it are.
> 
> Personally, I feel like setting up a separate service which is solely
> responsible for parsing file contents (to be indexed by Solr later on in
> the process) using Tika is a safer approach, so we can use whatever Tika
> version we want along with other things we might want to add.

Yes, do this. It's entirely possible to bring down Tika with a nasty 
PDF, or end up consuming lots of resources in the extraction step and 
have these impact your Solr server. Run it separately and you can 
monitor it/kill it if necessary.

You might like my colleague Matt Pearce's DropWizard wrapper for Tika 
https://github.com/mattflax/dropwizard-tika-server

Cheers

Charlie
> 
> Looking forward to your response!
> 
> Kind regards,
> Frederik
> 


-- 
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk