You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2016/06/14 12:53:28 UTC

RE: Bypassing ExtractingRequestHandler

Oh, wow.  Y, that's probably more than we'd want to support (unless any other Tika devs have an interest?)...very, very cool!


-----Original Message-----
From: Justin Lee [mailto:lee.justin.m@gmail.com] 
Sent: Monday, June 13, 2016 5:05 PM
To: solr-user@lucene.apache.org
Subject: Re: Bypassing ExtractingRequestHandler

Thanks everyone for the help and advice.  The SolrJ exmaple makes sense to me.  The import of SOLR-8166 was kind of mind boggling to me, but maybe I'll revisit after some time.

Tim: for context, I'm ultimately trying to create an external highlighter.
See https://issues.apache.org/jira/browse/SOLR-1397.  I want to store the bounding box (in PDF units) for each token in the extracted text stream.
Then when I get results from Solr using the above patch, I'll convert the
UTF-16 offsets into X/Y coordinates and perform highlighting as appropriate in the UI.  I like this approach because I get highlighting that accurately reflects the search, even when the search is complex (e.g. wildcards or proximity searches).

I think it would take quite a bit of thinking to get something general enough to add into Tika.  For example, what units?  Take a look at the discussion of what units to report offsets in here:
https://issues.apache.org/jira/browse/SOLR-1954 (see the comments by Robert Muir -- although whatever issues there are here they are the same as the offsets reported in the Term Vector Component, it would seem to me).  As another example, I'm just not sure what format is general enough to make sense for everybody.  I think I'll just create a mapping from UTF-16 offsets into (x1,y1) (x2,y2) pairs, dump it into a JSON blob, and store that in a NoSQL store.  Then, when I get Solr results, I'll look at the matching offsets, the JSON blob, and the original document and be on my merry way.  I'm happy to open a JIRA entry in Tika if you think this is a coherent request.

The other approach, I suppose, is to try to pass the information along during indexing and store as a token payload.  But it seems like the indexing interface is really text oriented.  I have also thought about using DelimitedPayloadTokenFilter, which will increase the index size I imagine (how much, though?) and require more customization of Solr internals.  I don't know which is the better approach.

On Mon, Jun 13, 2016 at 7:22 AM Allison, Timothy B. <ta...@mitre.org>
wrote:

>
>
>
> >Two things: Here's a sample bit of SolrJ code, pulling out the DB 
> >stuff
> should be straightforward:
> http://searchhub.org/2012/02/14/indexing-with-solrj/
>
> +1
>
> > We tend to prefer running Tika externally as it's entirely possible 
> > that Tika will crash or hang with certain files - and that will 
> > bring down Solr if you're running Tika within it.
>
> +1
>
> >> I want to make a small modification to Tika to get and save 
> >> additional data from my PDFs
> What info do you need, and if it is common enough, could you ask over 
> on Tika's JIRA and we'll try to add it directly?
>
>
>
>

Re: Bypassing ExtractingRequestHandler

Posted by Chris Mattmann <ch...@gmail.com>.
Hey Tim sounds great to me..

—
Chris Mattmann
chris.mattmann@gmail.com







On 6/14/16, 8:53 AM, "Allison, Timothy B." <ta...@mitre.org> wrote:

>Oh, wow.  Y, that's probably more than we'd want to support (unless any other Tika devs have an interest?)...very, very cool!
>
>
>-----Original Message-----
>From: Justin Lee [mailto:lee.justin.m@gmail.com] 
>Sent: Monday, June 13, 2016 5:05 PM
>To: solr-user@lucene.apache.org
>Subject: Re: Bypassing ExtractingRequestHandler
>
>Thanks everyone for the help and advice.  The SolrJ exmaple makes sense to me.  The import of SOLR-8166 was kind of mind boggling to me, but maybe I'll revisit after some time.
>
>Tim: for context, I'm ultimately trying to create an external highlighter.
>See https://issues.apache.org/jira/browse/SOLR-1397.  I want to store the bounding box (in PDF units) for each token in the extracted text stream.
>Then when I get results from Solr using the above patch, I'll convert the
>UTF-16 offsets into X/Y coordinates and perform highlighting as appropriate in the UI.  I like this approach because I get highlighting that accurately reflects the search, even when the search is complex (e.g. wildcards or proximity searches).
>
>I think it would take quite a bit of thinking to get something general enough to add into Tika.  For example, what units?  Take a look at the discussion of what units to report offsets in here:
>https://issues.apache.org/jira/browse/SOLR-1954 (see the comments by Robert Muir -- although whatever issues there are here they are the same as the offsets reported in the Term Vector Component, it would seem to me).  As another example, I'm just not sure what format is general enough to make sense for everybody.  I think I'll just create a mapping from UTF-16 offsets into (x1,y1) (x2,y2) pairs, dump it into a JSON blob, and store that in a NoSQL store.  Then, when I get Solr results, I'll look at the matching offsets, the JSON blob, and the original document and be on my merry way.  I'm happy to open a JIRA entry in Tika if you think this is a coherent request.
>
>The other approach, I suppose, is to try to pass the information along during indexing and store as a token payload.  But it seems like the indexing interface is really text oriented.  I have also thought about using DelimitedPayloadTokenFilter, which will increase the index size I imagine (how much, though?) and require more customization of Solr internals.  I don't know which is the better approach.
>
>On Mon, Jun 13, 2016 at 7:22 AM Allison, Timothy B. <ta...@mitre.org>
>wrote:
>
>>
>>
>>
>> >Two things: Here's a sample bit of SolrJ code, pulling out the DB 
>> >stuff
>> should be straightforward:
>> http://searchhub.org/2012/02/14/indexing-with-solrj/
>>
>> +1
>>
>> > We tend to prefer running Tika externally as it's entirely possible 
>> > that Tika will crash or hang with certain files - and that will 
>> > bring down Solr if you're running Tika within it.
>>
>> +1
>>
>> >> I want to make a small modification to Tika to get and save 
>> >> additional data from my PDFs
>> What info do you need, and if it is common enough, could you ask over 
>> on Tika's JIRA and we'll try to add it directly?
>>
>>
>>
>>