You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@nifi.apache.org by "Dmitry Goldenberg (JIRA)" <ji...@apache.org> on 2016/04/01 07:28:26 UTC

[jira] [Created] (NIFI-1717) Processor to extract metadata attributes and content from incoming files

Dmitry Goldenberg created NIFI-1717:
---------------------------------------

Summary: Processor to extract metadata attributes and content from incoming files
Key: NIFI-1717
URL: https://issues.apache.org/jira/browse/NIFI-1717
Project: Apache NiFi
Issue Type: New Feature
Components: Core Framework
Reporter: Dmitry Goldenberg

This would be some continuation of the work that Joe Skora did on the ExtractMediaAttributes processor.

The design discussions so far have centered around the following.

1. The processor will continue to use Apache Tika to extract metadata from incoming files, content from the incoming files, or both, as configured.
2. The extracted metadata shall be added as values of attributes on the FlowFile.
3. The extracted text shall be added as a value of the field "text".
4. There need to be configuration options to let the user tell the processor what needs to be extracted and for which cases. Building on the filename and MIME type filters provided by Joe:

* INCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which input files get their content extracted, by file name
* INCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which input files get their metadata extracted, by file name
* INCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which input files get their content extracted, by MIME type
* INCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which input files get their metadata extracted, by MIME type
* EXCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which input files do NOT get their content extracted, by file name
* EXCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which input files do NOT get their metadata extracted, by file name
* EXCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which input files do NOT get their content extracted, by MIME type
* EXCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which input files do NOT get their metadata extracted, by MIME type

Per Joe's point, an exclusion shall trump an inclusion rule.

Apache Tika has integrated support for OCR, via Tesseract. If Tesseract is installed and properly configured, Tika performs OCR on the image files such as PNG, BMP, JPEG, GIF, etc.

A separate ticket is to address how OCR should be handled, as it is an expensive operation which may require special configuration and handling.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)