You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@nifi.apache.org by "Dmitry Goldenberg (JIRA)" <ji...@apache.org> on 2016/04/01 07:37:25 UTC

[jira] [Created] (NIFI-1718) Processor(s) to perform OCR

Dmitry Goldenberg created NIFI-1718:
---------------------------------------

Summary: Processor(s) to perform OCR
Key: NIFI-1718
URL: https://issues.apache.org/jira/browse/NIFI-1718
Project: Apache NiFi
Issue Type: New Feature
Components: Core Framework
Reporter: Dmitry Goldenberg

This ticket is a follow-up to NIFI-1717.

Apache Tika by default performs OCR on image files such as PNG, BMP, JPEG, GIF, etc. using Tesseract, assuming that it is installed and properly configured.

Design issue: should ExtractMediaAttributes processor allow Tika to perform OCR or should OCR be handled elsewhere, whether by a processor or by a service? Could both models be allowed, where ExtractMediaAttributes supports OCR but there's also a separate PerformOCR processor and/or service?

If OCR is supported on the ExtractMediaAttributes processor, it'd be best if it supported the following OCR related options (which are exposed by Tika's TesseractOCRConfig class):

* tesseractPath - Path to tesseract installation folder, if not on system path.
* language - Language ID (e.g. "eng"); language dictionary to be used.
* pageSegMode - Tesseract page segmentation mode, defaults to 1.
* minFileSizeToOcr - Minimum file size to submit file to OCR, defaults to 0.
* maxFileSizeToOcr - Maximum file size to submit file to OCR, defaults to Integer.MAX_VALUE.
* timeout - Maximum time (in seconds) to wait for the OCR process termination; defaults to 120.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)