You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by jetnet <je...@gmail.com> on 2016/04/17 22:13:37 UTC

OCR black/white listing

Greetings to the Community!

a simple question: is there a way to white-/black-list certain mime- or
file-types for OCR?
E.g. I'd like to extract and OCR embedded images from PDFs only (which is
configurable for that parser, fortunately). The default behaviour for
Office parsers is always to extract and OCR inline images, which seems to
be unconfigurable (unfortunately). How to turn it off?
I played around with <parser-exclude>, <mime-exclude>, <mime> - but no luck.
Any ideas? Thanks a lot!