You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Justin Lee <le...@gmail.com> on 2016/06/10 01:20:07 UTC

Bypassing ExtractingRequestHandler

Has anybody had any experience bypassing ExtractingRequestHandler and
simply managing Tika manually?  I want to make a small modification to Tika
to get and save additional data from my PDFs, but I have been
procrastinating in no small part due to the unpleasant prospect of setting
up a development environment where I could compile and debug modifications
that might run through PDFBox, Tika, and ExtractingRequestHandler.  It
occurs to me that it would be much easier if the two were separate, so I
could have direct control over Tika and just submit the text to Solr after
extraction.  Am I going to regret this approach?  I'm not sure what
ExtractingRequestHandler really does for me that Tika doesn't already do.

Also, I was reading this
<http://stackoverflow.com/questions/33292776/solr-tika-processor-not-crawling-my-pdf-files-prefectly>
stackoverflow entry and someone offhandedly mentioned that
ExtractingRequestHandler might be separated in the future anyway. Is there
a public roadmap for the project, or does one have to keep up with the
developer's mailing list and hunt through JIRA entries to keep up with the
pulse of the project?

Thanks,
Justin

Re: Bypassing ExtractingRequestHandler

Posted by Chris Mattmann <ch...@gmail.com>.

Hey Tim sounds great to me..

—
Chris Mattmann
chris.mattmann@gmail.com







On 6/14/16, 8:53 AM, "Allison, Timothy B." <ta...@mitre.org> wrote:

>Oh, wow.  Y, that's probably more than we'd want to support (unless any other Tika devs have an interest?)...very, very cool!
>
>
>-----Original Message-----
>From: Justin Lee [mailto:lee.justin.m@gmail.com] 
>Sent: Monday, June 13, 2016 5:05 PM
>To: solr-user@lucene.apache.org
>Subject: Re: Bypassing ExtractingRequestHandler
>
>Thanks everyone for the help and advice.  The SolrJ exmaple makes sense to me.  The import of SOLR-8166 was kind of mind boggling to me, but maybe I'll revisit after some time.
>
>Tim: for context, I'm ultimately trying to create an external highlighter.
>See https://issues.apache.org/jira/browse/SOLR-1397.  I want to store the bounding box (in PDF units) for each token in the extracted text stream.
>Then when I get results from Solr using the above patch, I'll convert the
>UTF-16 offsets into X/Y coordinates and perform highlighting as appropriate in the UI.  I like this approach because I get highlighting that accurately reflects the search, even when the search is complex (e.g. wildcards or proximity searches).
>
>I think it would take quite a bit of thinking to get something general enough to add into Tika.  For example, what units?  Take a look at the discussion of what units to report offsets in here:
>https://issues.apache.org/jira/browse/SOLR-1954 (see the comments by Robert Muir -- although whatever issues there are here they are the same as the offsets reported in the Term Vector Component, it would seem to me).  As another example, I'm just not sure what format is general enough to make sense for everybody.  I think I'll just create a mapping from UTF-16 offsets into (x1,y1) (x2,y2) pairs, dump it into a JSON blob, and store that in a NoSQL store.  Then, when I get Solr results, I'll look at the matching offsets, the JSON blob, and the original document and be on my merry way.  I'm happy to open a JIRA entry in Tika if you think this is a coherent request.
>
>The other approach, I suppose, is to try to pass the information along during indexing and store as a token payload.  But it seems like the indexing interface is really text oriented.  I have also thought about using DelimitedPayloadTokenFilter, which will increase the index size I imagine (how much, though?) and require more customization of Solr internals.  I don't know which is the better approach.
>
>On Mon, Jun 13, 2016 at 7:22 AM Allison, Timothy B. <ta...@mitre.org>
>wrote:
>
>>
>>
>>
>> >Two things: Here's a sample bit of SolrJ code, pulling out the DB 
>> >stuff
>> should be straightforward:
>> http://searchhub.org/2012/02/14/indexing-with-solrj/
>>
>> +1
>>
>> > We tend to prefer running Tika externally as it's entirely possible 
>> > that Tika will crash or hang with certain files - and that will 
>> > bring down Solr if you're running Tika within it.
>>
>> +1
>>
>> >> I want to make a small modification to Tika to get and save 
>> >> additional data from my PDFs
>> What info do you need, and if it is common enough, could you ask over 
>> on Tika's JIRA and we'll try to add it directly?
>>
>>
>>
>>

RE: Bypassing ExtractingRequestHandler

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Oh, wow.  Y, that's probably more than we'd want to support (unless any other Tika devs have an interest?)...very, very cool!

-----Original Message-----
From: Justin Lee [mailto:lee.justin.m@gmail.com] 
Sent: Monday, June 13, 2016 5:05 PM
To: solr-user@lucene.apache.org
Subject: Re: Bypassing ExtractingRequestHandler

Thanks everyone for the help and advice.  The SolrJ exmaple makes sense to me.  The import of SOLR-8166 was kind of mind boggling to me, but maybe I'll revisit after some time.

Tim: for context, I'm ultimately trying to create an external highlighter.
See https://issues.apache.org/jira/browse/SOLR-1397.  I want to store the bounding box (in PDF units) for each token in the extracted text stream.
Then when I get results from Solr using the above patch, I'll convert the
UTF-16 offsets into X/Y coordinates and perform highlighting as appropriate in the UI.  I like this approach because I get highlighting that accurately reflects the search, even when the search is complex (e.g. wildcards or proximity searches).

I think it would take quite a bit of thinking to get something general enough to add into Tika.  For example, what units?  Take a look at the discussion of what units to report offsets in here:
https://issues.apache.org/jira/browse/SOLR-1954 (see the comments by Robert Muir -- although whatever issues there are here they are the same as the offsets reported in the Term Vector Component, it would seem to me).  As another example, I'm just not sure what format is general enough to make sense for everybody.  I think I'll just create a mapping from UTF-16 offsets into (x1,y1) (x2,y2) pairs, dump it into a JSON blob, and store that in a NoSQL store.  Then, when I get Solr results, I'll look at the matching offsets, the JSON blob, and the original document and be on my merry way.  I'm happy to open a JIRA entry in Tika if you think this is a coherent request.

The other approach, I suppose, is to try to pass the information along during indexing and store as a token payload.  But it seems like the indexing interface is really text oriented.  I have also thought about using DelimitedPayloadTokenFilter, which will increase the index size I imagine (how much, though?) and require more customization of Solr internals.  I don't know which is the better approach.

On Mon, Jun 13, 2016 at 7:22 AM Allison, Timothy B. <ta...@mitre.org>
wrote:

>
>
>
> >Two things: Here's a sample bit of SolrJ code, pulling out the DB 
> >stuff
> should be straightforward:
> http://searchhub.org/2012/02/14/indexing-with-solrj/
>
> +1
>
> > We tend to prefer running Tika externally as it's entirely possible 
> > that Tika will crash or hang with certain files - and that will 
> > bring down Solr if you're running Tika within it.
>
> +1
>
> >> I want to make a small modification to Tika to get and save 
> >> additional data from my PDFs
> What info do you need, and if it is common enough, could you ask over 
> on Tika's JIRA and we'll try to add it directly?
>
>
>
>

Re: Bypassing ExtractingRequestHandler

Posted by Justin Lee <le...@gmail.com>.

Thanks everyone for the help and advice.  The SolrJ exmaple makes sense to
me.  The import of SOLR-8166 was kind of mind boggling to me, but maybe
I'll revisit after some time.

Tim: for context, I'm ultimately trying to create an external highlighter.
See https://issues.apache.org/jira/browse/SOLR-1397.  I want to store the
bounding box (in PDF units) for each token in the extracted text stream.
Then when I get results from Solr using the above patch, I'll convert the
UTF-16 offsets into X/Y coordinates and perform highlighting as appropriate
in the UI.  I like this approach because I get highlighting that accurately
reflects the search, even when the search is complex (e.g. wildcards or
proximity searches).

I think it would take quite a bit of thinking to get something general
enough to add into Tika.  For example, what units?  Take a look at the
discussion of what units to report offsets in here:
https://issues.apache.org/jira/browse/SOLR-1954 (see the comments by Robert
Muir -- although whatever issues there are here they are the same as the
offsets reported in the Term Vector Component, it would seem to me).  As
another example, I'm just not sure what format is general enough to make
sense for everybody.  I think I'll just create a mapping from UTF-16
offsets into (x1,y1) (x2,y2) pairs, dump it into a JSON blob, and store
that in a NoSQL store.  Then, when I get Solr results, I'll look at the
matching offsets, the JSON blob, and the original document and be on my
merry way.  I'm happy to open a JIRA entry in Tika if you think this is a
coherent request.

The other approach, I suppose, is to try to pass the information along
during indexing and store as a token payload.  But it seems like the
indexing interface is really text oriented.  I have also thought about
using DelimitedPayloadTokenFilter, which will increase the index size I
imagine (how much, though?) and require more customization of Solr
internals.  I don't know which is the better approach.

On Mon, Jun 13, 2016 at 7:22 AM Allison, Timothy B. <ta...@mitre.org>
wrote:

>
>
>
> >Two things: Here's a sample bit of SolrJ code, pulling out the DB stuff
> should be straightforward:
> http://searchhub.org/2012/02/14/indexing-with-solrj/
>
> +1
>
> > We tend to prefer running Tika externally as it's entirely possible
> > that Tika will crash or hang with certain files - and that will bring
> > down Solr if you're running Tika within it.
>
> +1
>
> >> I want to make a small modification
> >> to Tika to get and save additional data from my PDFs
> What info do you need, and if it is common enough, could you ask over on
> Tika's JIRA and we'll try to add it directly?
>
>
>
>

RE: Bypassing ExtractingRequestHandler

Posted by "Allison, Timothy B." <ta...@mitre.org>.



>Two things: Here's a sample bit of SolrJ code, pulling out the DB stuff should be straightforward:
http://searchhub.org/2012/02/14/indexing-with-solrj/

+1

> We tend to prefer running Tika externally as it's entirely possible 
> that Tika will crash or hang with certain files - and that will bring 
> down Solr if you're running Tika within it.

+1

>> I want to make a small modification 
>> to Tika to get and save additional data from my PDFs
What info do you need, and if it is common enough, could you ask over on Tika's JIRA and we'll try to add it directly?

Re: Bypassing ExtractingRequestHandler

Posted by Erick Erickson <er...@gmail.com>.

Two things: Here's a sample bit of SolrJ code, pulling out
the DB stuff should be straightforward:
http://searchhub.org/2012/02/14/indexing-with-solrj/

It's a little out of date, but not very much so. CloudSolrServer
mentioned in one of the comments has been deprecated in
favor of CloudSolrClient, similarly StreamingUpdateSolrServer
is now ConcurrentUpdateSolrClient.


Second, since Solr 5.4 there is the capability to add parser specific
parameters through config, see SOLR-8166. I just added this to the
6.x Ref Guide today, it missed getting into the earlier ref guide
releases.

Best,
Erick

On Fri, Jun 10, 2016 at 1:22 AM, Charlie Hull <ch...@flax.co.uk> wrote:
> On 10/06/2016 02:20, Justin Lee wrote:
>>
>> Has anybody had any experience bypassing ExtractingRequestHandler and
>> simply managing Tika manually?  I want to make a small modification to
>> Tika
>> to get and save additional data from my PDFs, but I have been
>> procrastinating in no small part due to the unpleasant prospect of setting
>> up a development environment where I could compile and debug modifications
>> that might run through PDFBox, Tika, and ExtractingRequestHandler.  It
>> occurs to me that it would be much easier if the two were separate, so I
>> could have direct control over Tika and just submit the text to Solr after
>> extraction.  Am I going to regret this approach?  I'm not sure what
>> ExtractingRequestHandler really does for me that Tika doesn't already do.
>
>
> We tend to prefer running Tika externally as it's entirely possible that
> Tika will crash or hang with certain files - and that will bring down Solr
> if you're running Tika within it. Here's a Dropwizard wrapper around Tika
> that might be of use:
> https://github.com/mattflax/dropwizard-tika-server
>
> Cheers
>
> Charlie
>
>>
>> Also, I was reading this
>>
>> <http://stackoverflow.com/questions/33292776/solr-tika-processor-not-crawling-my-pdf-files-prefectly>
>> stackoverflow entry and someone offhandedly mentioned that
>> ExtractingRequestHandler might be separated in the future anyway. Is there
>> a public roadmap for the project, or does one have to keep up with the
>> developer's mailing list and hunt through JIRA entries to keep up with the
>> pulse of the project?
>>
>> Thanks,
>> Justin
>>
>
>
> --
> Charlie Hull
> Flax - Open Source Enterprise Search
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk

Re: Bypassing ExtractingRequestHandler

Posted by Charlie Hull <ch...@flax.co.uk>.

On 10/06/2016 02:20, Justin Lee wrote:
> Has anybody had any experience bypassing ExtractingRequestHandler and
> simply managing Tika manually?  I want to make a small modification to Tika
> to get and save additional data from my PDFs, but I have been
> procrastinating in no small part due to the unpleasant prospect of setting
> up a development environment where I could compile and debug modifications
> that might run through PDFBox, Tika, and ExtractingRequestHandler.  It
> occurs to me that it would be much easier if the two were separate, so I
> could have direct control over Tika and just submit the text to Solr after
> extraction.  Am I going to regret this approach?  I'm not sure what
> ExtractingRequestHandler really does for me that Tika doesn't already do.

We tend to prefer running Tika externally as it's entirely possible that 
Tika will crash or hang with certain files - and that will bring down 
Solr if you're running Tika within it. Here's a Dropwizard wrapper 
around Tika that might be of use:
https://github.com/mattflax/dropwizard-tika-server

Cheers

Charlie
>
> Also, I was reading this
> <http://stackoverflow.com/questions/33292776/solr-tika-processor-not-crawling-my-pdf-files-prefectly>
> stackoverflow entry and someone offhandedly mentioned that
> ExtractingRequestHandler might be separated in the future anyway. Is there
> a public roadmap for the project, or does one have to keep up with the
> developer's mailing list and hunt through JIRA entries to keep up with the
> pulse of the project?
>
> Thanks,
> Justin
>


-- 
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk