You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Raphaël <ra...@gmail.com> on 2012/05/27 21:57:10 UTC

Tika API and field postprocessing

Hi,

I use Tika through the Solr ExtractingRequestHandler and I face a very
common use case namely: postprocessing Tika fields in order to normalize
some fields values or override them with explicitly passed
"literal" values.

With exception of some vagues statements about "ContentHandler", I
failed to find some good examples about this (while it appears to be
quite an important feature)
I also would like to work at the API "field" level rather than working
with xpath on the raw Tika output.

Does anyone knows of some good resources/samples about the proper way to
"postprocess" fields in the context of a Solr integration ?

PS: I may have posted this on the Solr ML but I know that while Tika
outputs XML it also overrides fields passed to the
ExtractingRequestHandler so I guess that the changes I need to do would
rather apply somewhere around the Tika API.


thank you in advance

Re: Tika API and field postprocessing

Posted by Raphaël <ra...@gmail.com>.
On Sun, May 27, 2012 at 10:28:06PM +0100, Nick Burch wrote:
> On Sun, 27 May 2012, Raphaël wrote:
> > I use Tika through the Solr ExtractingRequestHandler and I face a very
> > common use case namely: postprocessing Tika fields in order to normalize
> > some fields values or override them with explicitly passed
> > "literal" values.
> 
> I believe you'll need to ask on the SOLR list about this, as it's likely 
> to be specific to ExtractingRequestHandler which is maintained by SOLR 
> rather than Tika. Once metadata comes back from Tika you can do anything 
> you want with it, the question is more what SOLR's 
> ExtractingRequestHandler supports


Then I'll try to grab some advices there [1]

thank you


NB: Digging further into the API, the UpdateRequestProcessor [2] may
look like a good candidate for hooking (but it's still a bit rough).


[1] http://apache.markmail.org/thread/idz73j4f7qi6fg6z
[2] https://lucene.apache.org/solr/api/org/apache/solr/update/processor/UpdateRequestProcessor.html


Re: Tika API and field postprocessing

Posted by Nick Burch <ni...@alfresco.com>.
On Sun, 27 May 2012, Raphaël wrote:
> I use Tika through the Solr ExtractingRequestHandler and I face a very
> common use case namely: postprocessing Tika fields in order to normalize
> some fields values or override them with explicitly passed
> "literal" values.

I believe you'll need to ask on the SOLR list about this, as it's likely 
to be specific to ExtractingRequestHandler which is maintained by SOLR 
rather than Tika. Once metadata comes back from Tika you can do anything 
you want with it, the question is more what SOLR's 
ExtractingRequestHandler supports

> I also would like to work at the API "field" level rather than working
> with xpath on the raw Tika output.

Fields are entirely SOLR/Lucene specific. Tika outputs metadata and 
content (as XHTML or plain text)

Nick