You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Andrea Asta <as...@gmail.com> on 2015/05/19 10:09:18 UTC
How to customize AutoDetectParser without changing the distribution
of Tika
Hello,
I was wondering if I could customize the AutoDetectParser without changing
the Tika jar files.
I am following the Parser 5 min quick start but can't figure out where to
add my new Parser.
Is there any programmatic way to alter the AutoDetectParser (and the Tika
facade) behaviour?
I would also use BoilerPipe as default HTML parser.
Thank you
Andrea
Re: How to customize AutoDetectParser without changing the distribution
of Tika
Posted by Nick Burch <ap...@gagravarr.org>.
On Tue, 19 May 2015, Andrea Asta wrote:
> Can you please give me more details about the service file?
See https://tika.apache.org/1.8/parser_guide.html#List_the_new_parser
Nick
Re: How to customize AutoDetectParser without changing the
distribution of Tika
Posted by Andrea Asta <as...@gmail.com>.
Can you please give me more details about the service file?
Reading the parser documentation I need to create:
- A file custom-mimetypes.xml in a package called org.apache.tika.mime with
just my new mime type
- A file for the AutoDetectParser, do I need to modify the Tika parsers
one? Or just create a new one (where?)?
I'll also read about the CompositeParser discussion, thanks.
Thanks
Andrea
2015-05-19 11:08 GMT+02:00 Nick Burch <ap...@gagravarr.org>:
> On Tue, 19 May 2015, Andrea Asta wrote:
>
>> I was wondering if I could customize the AutoDetectParser without
>> changing the Tika jar files.
>>
>
> Just add your own parser to the classpath along with a service file
>
> I am following the Parser 5 min quick start but can't figure out where to
>> add my new Parser.
>>
>
> Anywhere on your classpath. Can be in a new jar, or just a lone directory
> on the classpath, whatever works better
>
> Is there any programmatic way to alter the AutoDetectParser (and the Tika
>> facade) behaviour?
>>
>
> Currently, the default setup is that non-Tika parsers win over Tika ones
> when two parsers handle the same mime type. Otherwise, you can supply a
> tika config xml file that overrides things and forces different parsers,
> optionally while keeping everything else the same, see
> http://wiki.apache.org/tika/CompositeParserDiscussion
>
> Nick
>
Re: How to customize AutoDetectParser without changing the distribution
of Tika
Posted by Nick Burch <ap...@gagravarr.org>.
On Tue, 19 May 2015, Andrea Asta wrote:
> I was wondering if I could customize the AutoDetectParser without
> changing the Tika jar files.
Just add your own parser to the classpath along with a service file
> I am following the Parser 5 min quick start but can't figure out where to
> add my new Parser.
Anywhere on your classpath. Can be in a new jar, or just a lone directory
on the classpath, whatever works better
> Is there any programmatic way to alter the AutoDetectParser (and the Tika
> facade) behaviour?
Currently, the default setup is that non-Tika parsers win over Tika ones
when two parsers handle the same mime type. Otherwise, you can supply a
tika config xml file that overrides things and forces different parsers,
optionally while keeping everything else the same, see
http://wiki.apache.org/tika/CompositeParserDiscussion
Nick