You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Daniel Knapp <da...@mni.fh-giessen.de> on 2009/12/04 15:57:50 UTC

parsing only specified content types in archive

Hello,

is there an option to define the content types that should be parsed in an archive file?
for example i have a zip archive that contains jar and pdf files, tika should only parse the pdf files and skip the rest.

or is there an general option to define which content types should be parsed, using the Tika.parse(...) facade.

thanks in advance!

Regards,
Daniel


Re: parsing only specified content types in archive

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Fri, Dec 4, 2009 at 3:57 PM, Daniel Knapp
<da...@mni.fh-giessen.de> wrote:
> is there an option to define the content types that should be parsed in an archive file?
> for example i have a zip archive that contains jar and pdf files, tika should only parse
> the pdf files and skip the rest.

If you use the Parser interface directly you can pass in a custom
CompositeParser instance in the ParseContext to explicitly control how
component documents within an archive get parsed. Something like this
should do the trick:

    CompositeParser composite = new CompositeParser();
    composite.setParsers(Collections.singletonMap(
        "application/pdf", (Parser) new PDFParser()));

    ParseContext context = new ParseContext();
    context.set(Parser.class, composite);

    new AutoDetectParser().parse(..., context);

> or is there an general option to define which content types should be parsed, using
> the Tika.parse(...) facade.

You can modify the Tika configuration that you pass to the Tika
facade, but the same configuration applies both when you parse
top-level archive and any documents inside it, so this may not be
exactly what you're looking for.

BR,

Jukka Zitting