You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Konrad Holl <KH...@searchtechnologies.com> on 2016/03/09 15:45:18 UTC
[Tika content extraction Content Transformation Component] Additional
Options
Hi,
for a client project I needed to enable OCR for images inside PDFs. Unfortunately ManifoldCF does not provide configuration options to handle this. It would be nice to have these options for the Tika content extraction:
1. Enable PDF image extraction for OCR: https://tika.apache.org/1.7/api/org/apache/tika/parser/pdf/PDFParserConfig.html#setExtractInlineImages%28boolean%29
2. Set default language for tesseract: https://tika.apache.org/1.7/api/org/apache/tika/parser/ocr/TesseractOCRConfig.html#setLanguage%28java.lang.String%29
Thanks
-Konrad
KONRAD HOLL
Senior Technical Consultant
M +49 178 8855 553
F +49 178 99 8855 553
Skype: konrad.holl
Search Technologies GmbH
Theodor-Heuss-Allee 112
60486 Frankfurt am Main
SEARCH TECHNOLOGIES
Find Better Answers.
www.searchtechnologies.com<http://www.searchtechnologies.com/>