You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@nifi.apache.org by "Dmitry Goldenberg (JIRA)" <ji...@apache.org> on 2016/04/01 07:45:25 UTC

[jira] [Created] (NIFI-1719) Processor to handle text extraction from PDF (whether text or scanned PDF)

Dmitry Goldenberg created NIFI-1719:
---------------------------------------

Summary: Processor to handle text extraction from PDF (whether text or scanned PDF)
Key: NIFI-1719
URL: https://issues.apache.org/jira/browse/NIFI-1719
Project: Apache NiFi
Issue Type: New Feature
Components: Core Framework
Reporter: Dmitry Goldenberg

For a 'text' PDF, its text can be successfully extracted by Apache Tika. However, in the case of a scanned PDF, or a PDF with both textual and scanned content, more work can be done to extract text, by applying OCR.

Apache Tika has integrated support for OCR via Tesseract, assuming that it is installed and properly configured.

However, Tesseract doesn't handle scanned PDF's. The proposal here is to implement a processor which would break up a PDF into pages (e.g. using PDFBox) and send each such page into Apache Tika for OCR.

Each OCR'ed page would yield a piece of text; all such pieces of text will be aggregated together, in order, and placed into the "text" attribute on the FlowFile.

For a PDF file for which Tika already has extracted text, due to it being a text based PDF, no such OCR-based processing would need to be done.

The processor will need some OCR configuration parameters exposed, similar to NIFI-1718. Additionally, it can have a parameter for maximum number of pages to process with OCR.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)