You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nifi.apache.org by "Dmitry Goldenberg (JIRA)" <ji...@apache.org> on 2016/04/01 17:20:25 UTC

[jira] [Commented] (NIFI-1717) Processor to extract metadata attributes and content from incoming files

    [ https://issues.apache.org/jira/browse/NIFI-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15221837#comment-15221837 ] 

Dmitry Goldenberg commented on NIFI-1717:
-----------------------------------------

Mark Payne's comments from the discussion thread:
{quote}
I would be a bit concerned about providing options for filters that include and
exclude certain things. I believe that if you send a FlowFile to the Processor,
then the Processor should do its thing. If you want to filter out which FlowFiles
have their content extracted, for example, I would suggest using a Processor
like RouteOnAttribute to ensure that only the appropriate FlowFiles are processed
by the ExtractMediaMetadata processor.

This allows the metadata extraction processor to focus purely on extracting
metadata and doesn't have to deal with all of the logic of filtering things out. The logic
for filtering things out is almost guaranteed to grow much more complex as people
start to use this more and more. NiFi already provides several route-based processors
to allow for a great deal of flexibility with this type of logic (RouteOnAttribute, RouteOnContent,
ScanAttribute, ScanContent, etc.).
{quote}

> Processor to extract metadata attributes and content from incoming files
> ------------------------------------------------------------------------
>
>                 Key: NIFI-1717
>                 URL: https://issues.apache.org/jira/browse/NIFI-1717
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Core Framework
>            Reporter: Dmitry Goldenberg
>
> This would be some continuation of the work that Joe Skora did on the ExtractMediaAttributes processor.
> The design discussions so far have centered around the following.
> 1. The processor will continue to use Apache Tika to extract metadata from incoming files, content from the incoming files, or both, as configured.
> 2. The extracted metadata shall be added as values of attributes on the FlowFile.
> 3. The extracted text shall be added as a value of the field "text".
> 4. There need to be configuration options to let the user tell the processor what needs to be extracted and for which cases. Building on the filename and MIME type filters provided by Joe:
> * INCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which input files get their content extracted, by file name
> * INCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which input files get their metadata extracted, by file name
> * INCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which input files get their content extracted, by MIME type
> * INCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which input files get their metadata extracted, by MIME type
> * EXCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which input files do NOT get their content extracted, by file name
> * EXCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which input files do NOT get their metadata extracted, by file name
> * EXCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which input files do NOT get their content extracted, by MIME type
> * EXCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which input files do NOT get their metadata extracted, by MIME type
> Per Joe's point, an exclusion shall trump an inclusion rule.
> Apache Tika has integrated support for OCR, via Tesseract.  If Tesseract is installed and properly configured, Tika performs OCR on the image files such as PNG, BMP, JPEG, GIF, etc.
> A separate ticket NIFI-1718 is meant to address how OCR should be handled, as it is an expensive operation which may require special configuration and handling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)