You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Andrea Asta <as...@gmail.com> on 2015/05/19 10:09:18 UTC

How to customize AutoDetectParser without changing the distribution of Tika

Hello,
I was wondering if I could customize the AutoDetectParser without changing
the Tika jar files.

I am following the Parser 5 min quick start but can't figure out where to
add my new Parser.

Is there any programmatic way to alter the AutoDetectParser (and the Tika
facade) behaviour?

I would also use BoilerPipe as default HTML parser.

Thank you
Andrea

Re: How to customize AutoDetectParser without changing the distribution of Tika

Posted by Nick Burch <ap...@gagravarr.org>.
On Tue, 19 May 2015, Andrea Asta wrote:
> Can you please give me more details about the service file?

See https://tika.apache.org/1.8/parser_guide.html#List_the_new_parser

Nick

Re: How to customize AutoDetectParser without changing the distribution of Tika

Posted by Andrea Asta <as...@gmail.com>.
Can you please give me more details about the service file?

Reading the parser documentation I need to create:
- A file custom-mimetypes.xml in a package called org.apache.tika.mime with
just my new mime type
- A file for the AutoDetectParser, do I need to modify the Tika parsers
one? Or just create a new one (where?)?

I'll also read about the CompositeParser discussion, thanks.

Thanks
Andrea


2015-05-19 11:08 GMT+02:00 Nick Burch <ap...@gagravarr.org>:

> On Tue, 19 May 2015, Andrea Asta wrote:
>
>> I was wondering if I could customize the AutoDetectParser without
>> changing the Tika jar files.
>>
>
> Just add your own parser to the classpath along with a service file
>
>  I am following the Parser 5 min quick start but can't figure out where to
>> add my new Parser.
>>
>
> Anywhere on your classpath. Can be in a new jar, or just a lone directory
> on the classpath, whatever works better
>
>  Is there any programmatic way to alter the AutoDetectParser (and the Tika
>> facade) behaviour?
>>
>
> Currently, the default setup is that non-Tika parsers win over Tika ones
> when two parsers handle the same mime type. Otherwise, you can supply a
> tika config xml file that overrides things and forces different parsers,
> optionally while keeping everything else the same, see
> http://wiki.apache.org/tika/CompositeParserDiscussion
>
> Nick
>

Re: How to customize AutoDetectParser without changing the distribution of Tika

Posted by Nick Burch <ap...@gagravarr.org>.
On Tue, 19 May 2015, Andrea Asta wrote:
> I was wondering if I could customize the AutoDetectParser without 
> changing the Tika jar files.

Just add your own parser to the classpath along with a service file

> I am following the Parser 5 min quick start but can't figure out where to
> add my new Parser.

Anywhere on your classpath. Can be in a new jar, or just a lone directory 
on the classpath, whatever works better

> Is there any programmatic way to alter the AutoDetectParser (and the Tika
> facade) behaviour?

Currently, the default setup is that non-Tika parsers win over Tika ones 
when two parsers handle the same mime type. Otherwise, you can supply a 
tika config xml file that overrides things and forces different parsers, 
optionally while keeping everything else the same, see
http://wiki.apache.org/tika/CompositeParserDiscussion

Nick