You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Karl Wright (JIRA)" <ji...@apache.org> on 2016/04/05 07:38:25 UTC

[jira] [Commented] (CONNECTORS-1287) Additional TikaOCR Configuration Options

    [ https://issues.apache.org/jira/browse/CONNECTORS-1287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225715#comment-15225715 ] 

Karl Wright commented on CONNECTORS-1287:
-----------------------------------------

I'm pushing this on to MCF 2.5.

I think that as long as Tessaract is properly installed on all machines in a cluster, it's OK to have a JNI dependency as a requirement.  Model files, however, need to be worked out.  Specifically, if there is any need to select a model for the OCR configuration, the model files should be handled in a manner similar to how the OpenNLP integration does it: there's a well-known and configured folder that these model files must be found in.  I don't know enough about Tesseract to know if this is going to be a problem or not though.


> Additional TikaOCR Configuration Options
> ----------------------------------------
>
>                 Key: CONNECTORS-1287
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1287
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Tika extractor
>    Affects Versions: ManifoldCF 2.3
>            Reporter: Konrad Holl
>            Assignee: Karl Wright
>            Priority: Minor
>             Fix For: ManifoldCF 2.5
>
>
> For a client project I needed to enable OCR for images inside PDFs. Unfortunately ManifoldCF does not provide configuration options to handle this. It would be nice to have these options for the Tika content extraction:
> 1.	Enable PDF image extraction for OCR: https://tika.apache.org/1.7/api/org/apache/tika/parser/pdf/PDFParserConfig.html#setExtractInlineImages%28boolean%29
> 2.	Set default language for tesseract: https://tika.apache.org/1.7/api/org/apache/tika/parser/ocr/TesseractOCRConfig.html#setLanguage%28java.lang.String%29
> Tika OCR is based on tesseract, an Open Source OCR library intially developed by Hewlett-Packard and later continued by Google. It is available from https://github.com/tesseract-ocr/tesseract . It needs to be installed with the tesseract binary available in the PATH environment variable - alternatively it can be set using an Tika API method. Once it is installed and Tika is configured correctly, it works like a charm.
> When indexing images or PDFs containing images instead of real text, OCR is necessary for making those documents searchable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)