You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by Konrad Holl <KH...@searchtechnologies.com> on 2016/03/09 15:45:18 UTC

[Tika content extraction Content Transformation Component] Additional Options

Hi,

for a client project I needed to enable OCR for images inside PDFs. Unfortunately ManifoldCF does not provide configuration options to handle this. It would be nice to have these options for the Tika content extraction:


1.       Enable PDF image extraction for OCR: https://tika.apache.org/1.7/api/org/apache/tika/parser/pdf/PDFParserConfig.html#setExtractInlineImages%28boolean%29

2.       Set default language for tesseract: https://tika.apache.org/1.7/api/org/apache/tika/parser/ocr/TesseractOCRConfig.html#setLanguage%28java.lang.String%29

Thanks

-Konrad

KONRAD HOLL
Senior Technical Consultant

M +49 178 8855 553
F  +49 178 99 8855 553
Skype: konrad.holl

Search Technologies GmbH
Theodor-Heuss-Allee 112
60486 Frankfurt am Main

SEARCH TECHNOLOGIES
Find Better Answers.
www.searchtechnologies.com<http://www.searchtechnologies.com/>