You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nifi.apache.org by "Jeremy Dyer (JIRA)" <ji...@apache.org> on 2016/04/20 01:14:25 UTC
[jira] [Commented] (NIFI-1718) Processor(s) to perform OCR

    [ https://issues.apache.org/jira/browse/NIFI-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248878#comment-15248878 ] 

Jeremy Dyer commented on NIFI-1718:
-----------------------------------

[~dgoldenberg] I came to create a jira for a NiFi Tesseract processor today and stumbled across this jira. Seems I'm a few days late. I created a purely Tesseract processor already accounts for all of the bullet points you listed (and the ability to pass in raw configuration key/values) but it doesn't use Tika as you have described here. I would be glad to contribute what I have but wanted run it by you first since you specifically called out Tika and I'm not using that. Would it be a big deal if my implementation didn't use Tika explicitly or are you needing that for something else?

Just for reference here is a quick screen recording of what I have so far https://www.linkedin.com/pulse/nifi-ocr-using-apache-read-childrens-books-jeremy-dyer

> Processor(s) to perform OCR
> ---------------------------
>
>                 Key: NIFI-1718
>                 URL: https://issues.apache.org/jira/browse/NIFI-1718
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Core Framework
>            Reporter: Dmitry Goldenberg
>
> This ticket is a follow-up to NIFI-1717.
> Apache Tika by default performs OCR on image files such as PNG, BMP, JPEG, GIF, etc. using Tesseract, assuming that it is installed and properly configured.
> Design issue: should ExtractMediaAttributes processor allow Tika to perform OCR or should OCR be handled elsewhere, whether by a processor or by a service?  Could both models be allowed, where ExtractMediaAttributes supports OCR but there's also a separate PerformOCR processor and/or service?
> If OCR is supported on the ExtractMediaAttributes processor, it'd be best if it supported the following OCR related options (which are exposed by Tika's TesseractOCRConfig class):
> * tesseractPath - Path to tesseract installation folder, if not on system path.
> * language - Language ID (e.g. "eng"); language dictionary to be used.
> * pageSegMode - Tesseract page segmentation mode, defaults to 1.
> * minFileSizeToOcr - Minimum file size to submit file to OCR, defaults to 0.
> * maxFileSizeToOcr - Maximum file size to submit file to OCR, defaults to Integer.MAX_VALUE.
> * timeout - Maximum time (in seconds) to wait for the OCR process termination; defaults to 120.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)