You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nifi.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2015/02/17 19:55:11 UTC
[jira] [Commented] (NIFI-296) Extend the capability of IdentifyMimeType and extract document metadata

    [ https://issues.apache.org/jira/browse/NIFI-296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324665#comment-14324665 ] 

ASF GitHub Bot commented on NIFI-296:
-------------------------------------

GitHub user adamonduty opened a pull request:

    https://github.com/apache/incubator-nifi/pull/27

    NIFI-296: Extend capability of IdentifyMimeType

    ```
    This commit backs IdentifyMimeType with the Apache Tika library. Tika
    provides detailed mime type identification such as the ability to
    differentiate normal zip files from OOXML MS Office documents.
    
    The mime.type attribute continues to be set, though some mime types
    have changed due to Tika naming them differently. In addition,
    the mime.extension attribute is set to provide the commonly used
    extension for the mime type (if known).
    ```
    
    Some additional notes about this commit:
    
    I removed the IDENTIFY_ZIP and IDENTIFY_TAR properties. Keeping IDENTIFY_ZIP doesn't make sense because Tika is designed to identify container formats like zip files. Excluding zip files from detection would exclude a number of common mime types, which seems like undesirable behavior. IDENTIFY_TAR is in a similar situation.
    
    Also, in both cases, the previous code would "identify" a zip or tar file by attempting to open them with Zip and Tar readers. I believe Tika will use magic byte detection as a filtering mechanism to avoid applying deep inspection logic (ie opening the zip with a reader) when not necessary.
    
    It takes about 2 seconds to bring up the Tika detectors, which makes the tests run longer, but I believe the detection itself is roughly in the same performance category. The code shares a Tika config and list of detectors to minimize the performance impact related to bringing up detectors.
    
    I also replaced the test resource `1.tar` with a version created by a modern version of tar. The previous tar didn't use the <a href="http://en.wikipedia.org/wiki/Tar_%28computing%29#UStar_format">ustar format</a>, which was standardized in 1988. Tika also couldn't identify the previous tar using magic byte
    detection.
    
    And finally, a few of the detected mime types changed names due to Tika naming them differently.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/adamonduty/incubator-nifi NIFI-296-extend-IdentifyMimeType

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-nifi/pull/27.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #27
    
----
commit 16fb2b826c0cd983b5d905ceed7aff2a84383d33
Author: Adam Lamar <ad...@gmail.com>
Date:   2015-02-14T20:57:41Z

    NIFI-296: Extend capability of IdentifyMimeType
    
    This commit backs IdentifyMimeType with the Apache Tika library. Tika
    provides detailed mime type identification such as the ability to
    differentiate normal zip files from OOXML MS Office documents.
    
    The mime.type attribute continues to be set, though some mime types
    have changed due to Tika naming them differently. In addition,
    the mime.extension attribute is set to provide the commonly used
    extension for the mime type (if known).

----


> Extend the capability of IdentifyMimeType and extract document metadata
> -----------------------------------------------------------------------
>
>                 Key: NIFI-296
>                 URL: https://issues.apache.org/jira/browse/NIFI-296
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Extensions
>            Reporter: Joseph Witt
>            Priority: Minor
>
> Apache Tika is pretty awesome and can handle a large range of document types.  It could perhaps be used to extend the capability of IdentifyMimeType and it could also potentially be used to automatically extract document metadata/data as flow file attributes to be used for data flow routing decisions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)