You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Keith R. Bennett (JIRA)" <ji...@apache.org> on 2007/10/18 04:56:50 UTC

[jira] Created: (TIKA-79) Mime type detection from file header appears to be failing.

Mime type detection from file header appears to be failing.
-----------------------------------------------------------

                 Key: TIKA-79
                 URL: https://issues.apache.org/jira/browse/TIKA-79
             Project: Tika
          Issue Type: Bug
          Components: general
    Affects Versions: 0.1-incubator
            Reporter: Keith R. Bennett
             Fix For: 0.1-incubator


Unit tests to test the behavior of AutoDetectParser fail when byte header detection is needed.  When correct names of resources and MIME types are passed into the Metadata object, the values below show what was found.  Note that some of the document types have null for typeFromHeader:


typeFromContentTypeHint = application/vnd.ms-excel
typeFromResourceName = application/vnd.ms-excel
typeFromHeader = null
type = application/vnd.ms-excel


typeFromContentTypeHint = text/html
typeFromResourceName = text/html
typeFromHeader = text/html
type = text/html


typeFromContentTypeHint = application/vnd.oasis.opendocument.text
typeFromResourceName = application/vnd.oasis.opendocument.text
typeFromHeader = application/vnd.oasis.opendocument.text
type = application/vnd.oasis.opendocument.text


typeFromContentTypeHint = application/pdf
typeFromResourceName = application/pdf
typeFromHeader = application/pdf
type = application/pdf


typeFromContentTypeHint = application/vnd.ms-powerpoint
typeFromResourceName = application/vnd.ms-powerpoint
typeFromHeader = null
type = application/vnd.ms-powerpoint

log4j:WARN No appenders could be found for logger (root).
log4j:WARN Please initialize the log4j system properly.

typeFromContentTypeHint = application/rtf
typeFromResourceName = application/rtf
typeFromHeader = null
type = application/rtf


typeFromContentTypeHint = text/plain
typeFromResourceName = text/plain
typeFromHeader = null
type = text/plain


typeFromContentTypeHint = application/msword
typeFromResourceName = application/msword
typeFromHeader = null
type = application/msword


typeFromContentTypeHint = application/xml
typeFromResourceName = application/xml
typeFromHeader = null
type = application/xml



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (TIKA-79) Mime type detection from file header appears to be failing.

Posted by "Keith R. Bennett" <kb...@bbsinc.biz>.

Chris -

If I understand correctly, we already have what we need in MimeUtils:
    public String getType(String typeName, String url, byte[] data) { ... }

Jukka, should I modify AutoDetectParser to call this method instead of its
own?

However, the bigger issue is, is the assessment that header based detection
fails with certain file types correct?  For example, it fails to identify
the type of the Powerpoint test document we provide.  Do we know which types
can and can't be detected?  If so, it would be helpful to our users and
ourselves to document that information.  I could put something together
based on my observations, but that would risk being incomplete or incorrect
due to different document software versions (e.g. Word).

- Keith

JIRA jira@apache.org wrote:
> 
> 
>     [
> https://issues.apache.org/jira/browse/TIKA-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535917
> ] 
> 
> Chris A. Mattmann commented on TIKA-79:
> ---------------------------------------
> 
> Guys:
> 
> Why don't we put a utility method in MimeUtils to handle this
> functionality. The purpose of the utility method is to try and sense a
> mime type using all available options (URL resolution, extension ID, mime
> magic, etc.)
> 
> There is currently code in Nutch at:
> 
> http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/protocol/Content.java?view=markup
> 
> See the private String getContentType(String typeName, String url, byte[]
> data) method at the bottom of the class to see how Nutch does this sort of
> failsafe mime resolution. Perhaps we should follow similar suit in Tika?
> 
> Cheers,
>  Chris
> 

-- 
View this message in context: http://www.nabble.com/-jira--Created%3A-%28TIKA-79%29-Mime-type-detection-from-file-header-appears-to-be-failing.-tf4644634.html#a13276570
Sent from the Apache Tika - Development mailing list archive at Nabble.com.

[jira] Commented: (TIKA-79) Mime type detection from file header appears to be failing.

Posted by "Bertrand Delacretaz (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535919 ] 

Bertrand Delacretaz commented on TIKA-79:
-----------------------------------------

+1 for a utility method as proposed by Chris, that tries several detection methods.

> Mime type detection from file header appears to be failing.
> -----------------------------------------------------------
>
>                 Key: TIKA-79
>                 URL: https://issues.apache.org/jira/browse/TIKA-79
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 0.1-incubator
>            Reporter: Keith R. Bennett
>             Fix For: 0.1-incubator
>
>         Attachments: AutoDetectParser.patch
>
>
> Unit tests to test the behavior of AutoDetectParser fail when byte header detection is needed.  When correct names of resources and MIME types are passed into the Metadata object, the values below show what was found.  Note that some of the document types have null for typeFromHeader:
> typeFromContentTypeHint = application/vnd.ms-excel
> typeFromResourceName = application/vnd.ms-excel
> typeFromHeader = null
> type = application/vnd.ms-excel
> typeFromContentTypeHint = text/html
> typeFromResourceName = text/html
> typeFromHeader = text/html
> type = text/html
> typeFromContentTypeHint = application/vnd.oasis.opendocument.text
> typeFromResourceName = application/vnd.oasis.opendocument.text
> typeFromHeader = application/vnd.oasis.opendocument.text
> type = application/vnd.oasis.opendocument.text
> typeFromContentTypeHint = application/pdf
> typeFromResourceName = application/pdf
> typeFromHeader = application/pdf
> type = application/pdf
> typeFromContentTypeHint = application/vnd.ms-powerpoint
> typeFromResourceName = application/vnd.ms-powerpoint
> typeFromHeader = null
> type = application/vnd.ms-powerpoint
> log4j:WARN No appenders could be found for logger (root).
> log4j:WARN Please initialize the log4j system properly.
> typeFromContentTypeHint = application/rtf
> typeFromResourceName = application/rtf
> typeFromHeader = null
> type = application/rtf
> typeFromContentTypeHint = text/plain
> typeFromResourceName = text/plain
> typeFromHeader = null
> type = text/plain
> typeFromContentTypeHint = application/msword
> typeFromResourceName = application/msword
> typeFromHeader = null
> type = application/msword
> typeFromContentTypeHint = application/xml
> typeFromResourceName = application/xml
> typeFromHeader = null
> type = application/xml

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-79) Mime type detection from file header appears to be failing.

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535917 ] 

Chris A. Mattmann commented on TIKA-79:
---------------------------------------

Guys:

Why don't we put a utility method in MimeUtils to handle this functionality. The purpose of the utility method is to try and sense a mime type using all available options (URL resolution, extension ID, mime magic, etc.)

There is currently code in Nutch at:

http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/protocol/Content.java?view=markup

See the private String getContentType(String typeName, String url, byte[] data) method at the bottom of the class to see how Nutch does this sort of failsafe mime resolution. Perhaps we should follow similar suit in Tika?

Cheers,
 Chris


> Mime type detection from file header appears to be failing.
> -----------------------------------------------------------
>
>                 Key: TIKA-79
>                 URL: https://issues.apache.org/jira/browse/TIKA-79
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 0.1-incubator
>            Reporter: Keith R. Bennett
>             Fix For: 0.1-incubator
>
>         Attachments: AutoDetectParser.patch
>
>
> Unit tests to test the behavior of AutoDetectParser fail when byte header detection is needed.  When correct names of resources and MIME types are passed into the Metadata object, the values below show what was found.  Note that some of the document types have null for typeFromHeader:
> typeFromContentTypeHint = application/vnd.ms-excel
> typeFromResourceName = application/vnd.ms-excel
> typeFromHeader = null
> type = application/vnd.ms-excel
> typeFromContentTypeHint = text/html
> typeFromResourceName = text/html
> typeFromHeader = text/html
> type = text/html
> typeFromContentTypeHint = application/vnd.oasis.opendocument.text
> typeFromResourceName = application/vnd.oasis.opendocument.text
> typeFromHeader = application/vnd.oasis.opendocument.text
> type = application/vnd.oasis.opendocument.text
> typeFromContentTypeHint = application/pdf
> typeFromResourceName = application/pdf
> typeFromHeader = application/pdf
> type = application/pdf
> typeFromContentTypeHint = application/vnd.ms-powerpoint
> typeFromResourceName = application/vnd.ms-powerpoint
> typeFromHeader = null
> type = application/vnd.ms-powerpoint
> log4j:WARN No appenders could be found for logger (root).
> log4j:WARN Please initialize the log4j system properly.
> typeFromContentTypeHint = application/rtf
> typeFromResourceName = application/rtf
> typeFromHeader = null
> type = application/rtf
> typeFromContentTypeHint = text/plain
> typeFromResourceName = text/plain
> typeFromHeader = null
> type = text/plain
> typeFromContentTypeHint = application/msword
> typeFromResourceName = application/msword
> typeFromHeader = null
> type = application/msword
> typeFromContentTypeHint = application/xml
> typeFromResourceName = application/xml
> typeFromHeader = null
> type = application/xml

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-79) Mime type detection from file header appears to be failing.

Posted by "Keith R. Bennett (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-79?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Keith R. Bennett updated TIKA-79:
---------------------------------

    Attachment: AutoDetectParser.patch

The attached patch file reorganizes the MIME type determination in AutoDetectParser so that it is easier to print out the types found by the various methods, and the logic for choosing the predominant result is confined to a smaller area (assuming I understood the intent correctly, that is).  In other words, I found it easier to debug.  If you like, I can commit it, minus the print statements.

I also found it helpful to comment out the LOG.info() call in MimeTypes.load().  (Is there a better way to disable it, by setting that logger to some kind of null appender or someting like that?)


> Mime type detection from file header appears to be failing.
> -----------------------------------------------------------
>
>                 Key: TIKA-79
>                 URL: https://issues.apache.org/jira/browse/TIKA-79
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 0.1-incubator
>            Reporter: Keith R. Bennett
>             Fix For: 0.1-incubator
>
>         Attachments: AutoDetectParser.patch
>
>
> Unit tests to test the behavior of AutoDetectParser fail when byte header detection is needed.  When correct names of resources and MIME types are passed into the Metadata object, the values below show what was found.  Note that some of the document types have null for typeFromHeader:
> typeFromContentTypeHint = application/vnd.ms-excel
> typeFromResourceName = application/vnd.ms-excel
> typeFromHeader = null
> type = application/vnd.ms-excel
> typeFromContentTypeHint = text/html
> typeFromResourceName = text/html
> typeFromHeader = text/html
> type = text/html
> typeFromContentTypeHint = application/vnd.oasis.opendocument.text
> typeFromResourceName = application/vnd.oasis.opendocument.text
> typeFromHeader = application/vnd.oasis.opendocument.text
> type = application/vnd.oasis.opendocument.text
> typeFromContentTypeHint = application/pdf
> typeFromResourceName = application/pdf
> typeFromHeader = application/pdf
> type = application/pdf
> typeFromContentTypeHint = application/vnd.ms-powerpoint
> typeFromResourceName = application/vnd.ms-powerpoint
> typeFromHeader = null
> type = application/vnd.ms-powerpoint
> log4j:WARN No appenders could be found for logger (root).
> log4j:WARN Please initialize the log4j system properly.
> typeFromContentTypeHint = application/rtf
> typeFromResourceName = application/rtf
> typeFromHeader = null
> type = application/rtf
> typeFromContentTypeHint = text/plain
> typeFromResourceName = text/plain
> typeFromHeader = null
> type = text/plain
> typeFromContentTypeHint = application/msword
> typeFromResourceName = application/msword
> typeFromHeader = null
> type = application/msword
> typeFromContentTypeHint = application/xml
> typeFromResourceName = application/xml
> typeFromHeader = null
> type = application/xml

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (TIKA-79) Mime type detection from file header appears to be failing.

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-79?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann reassigned TIKA-79:
-------------------------------------

    Assignee: Chris A. Mattmann

> Mime type detection from file header appears to be failing.
> -----------------------------------------------------------
>
>                 Key: TIKA-79
>                 URL: https://issues.apache.org/jira/browse/TIKA-79
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 0.1-incubating
>            Reporter: Keith R. Bennett
>            Assignee: Chris A. Mattmann
>             Fix For: 0.2-incubating
>
>         Attachments: AutoDetectParser.patch
>
>
> Unit tests to test the behavior of AutoDetectParser fail when byte header detection is needed.  When correct names of resources and MIME types are passed into the Metadata object, the values below show what was found.  Note that some of the document types have null for typeFromHeader:
> typeFromContentTypeHint = application/vnd.ms-excel
> typeFromResourceName = application/vnd.ms-excel
> typeFromHeader = null
> type = application/vnd.ms-excel
> typeFromContentTypeHint = text/html
> typeFromResourceName = text/html
> typeFromHeader = text/html
> type = text/html
> typeFromContentTypeHint = application/vnd.oasis.opendocument.text
> typeFromResourceName = application/vnd.oasis.opendocument.text
> typeFromHeader = application/vnd.oasis.opendocument.text
> type = application/vnd.oasis.opendocument.text
> typeFromContentTypeHint = application/pdf
> typeFromResourceName = application/pdf
> typeFromHeader = application/pdf
> type = application/pdf
> typeFromContentTypeHint = application/vnd.ms-powerpoint
> typeFromResourceName = application/vnd.ms-powerpoint
> typeFromHeader = null
> type = application/vnd.ms-powerpoint
> log4j:WARN No appenders could be found for logger (root).
> log4j:WARN Please initialize the log4j system properly.
> typeFromContentTypeHint = application/rtf
> typeFromResourceName = application/rtf
> typeFromHeader = null
> type = application/rtf
> typeFromContentTypeHint = text/plain
> typeFromResourceName = text/plain
> typeFromHeader = null
> type = text/plain
> typeFromContentTypeHint = application/msword
> typeFromResourceName = application/msword
> typeFromHeader = null
> type = application/msword
> typeFromContentTypeHint = application/xml
> typeFromResourceName = application/xml
> typeFromHeader = null
> type = application/xml

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-79) Mime type detection from file header appears to be failing.

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535934 ] 

Jukka Zitting commented on TIKA-79:
-----------------------------------

+1, makes sense to push the functionaly back to the MIME framework and also to target the test cases directly there instead of testing with AutoDetectParser.

> Mime type detection from file header appears to be failing.
> -----------------------------------------------------------
>
>                 Key: TIKA-79
>                 URL: https://issues.apache.org/jira/browse/TIKA-79
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 0.1-incubator
>            Reporter: Keith R. Bennett
>             Fix For: 0.1-incubator
>
>         Attachments: AutoDetectParser.patch
>
>
> Unit tests to test the behavior of AutoDetectParser fail when byte header detection is needed.  When correct names of resources and MIME types are passed into the Metadata object, the values below show what was found.  Note that some of the document types have null for typeFromHeader:
> typeFromContentTypeHint = application/vnd.ms-excel
> typeFromResourceName = application/vnd.ms-excel
> typeFromHeader = null
> type = application/vnd.ms-excel
> typeFromContentTypeHint = text/html
> typeFromResourceName = text/html
> typeFromHeader = text/html
> type = text/html
> typeFromContentTypeHint = application/vnd.oasis.opendocument.text
> typeFromResourceName = application/vnd.oasis.opendocument.text
> typeFromHeader = application/vnd.oasis.opendocument.text
> type = application/vnd.oasis.opendocument.text
> typeFromContentTypeHint = application/pdf
> typeFromResourceName = application/pdf
> typeFromHeader = application/pdf
> type = application/pdf
> typeFromContentTypeHint = application/vnd.ms-powerpoint
> typeFromResourceName = application/vnd.ms-powerpoint
> typeFromHeader = null
> type = application/vnd.ms-powerpoint
> log4j:WARN No appenders could be found for logger (root).
> log4j:WARN Please initialize the log4j system properly.
> typeFromContentTypeHint = application/rtf
> typeFromResourceName = application/rtf
> typeFromHeader = null
> type = application/rtf
> typeFromContentTypeHint = text/plain
> typeFromResourceName = text/plain
> typeFromHeader = null
> type = text/plain
> typeFromContentTypeHint = application/msword
> typeFromResourceName = application/msword
> typeFromHeader = null
> type = application/msword
> typeFromContentTypeHint = application/xml
> typeFromResourceName = application/xml
> typeFromHeader = null
> type = application/xml

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.