You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Chris A. Mattmann (JIRA)" <ji...@apache.org> on 2007/10/18 15:42:50 UTC

[jira] Commented: (TIKA-79) Mime type detection from file header appears to be failing.

    [ https://issues.apache.org/jira/browse/TIKA-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535917 ] 

Chris A. Mattmann commented on TIKA-79:
---------------------------------------

Guys:

Why don't we put a utility method in MimeUtils to handle this functionality. The purpose of the utility method is to try and sense a mime type using all available options (URL resolution, extension ID, mime magic, etc.)

There is currently code in Nutch at:

http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/protocol/Content.java?view=markup

See the private String getContentType(String typeName, String url, byte[] data) method at the bottom of the class to see how Nutch does this sort of failsafe mime resolution. Perhaps we should follow similar suit in Tika?

Cheers,
 Chris


> Mime type detection from file header appears to be failing.
> -----------------------------------------------------------
>
>                 Key: TIKA-79
>                 URL: https://issues.apache.org/jira/browse/TIKA-79
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 0.1-incubator
>            Reporter: Keith R. Bennett
>             Fix For: 0.1-incubator
>
>         Attachments: AutoDetectParser.patch
>
>
> Unit tests to test the behavior of AutoDetectParser fail when byte header detection is needed.  When correct names of resources and MIME types are passed into the Metadata object, the values below show what was found.  Note that some of the document types have null for typeFromHeader:
> typeFromContentTypeHint = application/vnd.ms-excel
> typeFromResourceName = application/vnd.ms-excel
> typeFromHeader = null
> type = application/vnd.ms-excel
> typeFromContentTypeHint = text/html
> typeFromResourceName = text/html
> typeFromHeader = text/html
> type = text/html
> typeFromContentTypeHint = application/vnd.oasis.opendocument.text
> typeFromResourceName = application/vnd.oasis.opendocument.text
> typeFromHeader = application/vnd.oasis.opendocument.text
> type = application/vnd.oasis.opendocument.text
> typeFromContentTypeHint = application/pdf
> typeFromResourceName = application/pdf
> typeFromHeader = application/pdf
> type = application/pdf
> typeFromContentTypeHint = application/vnd.ms-powerpoint
> typeFromResourceName = application/vnd.ms-powerpoint
> typeFromHeader = null
> type = application/vnd.ms-powerpoint
> log4j:WARN No appenders could be found for logger (root).
> log4j:WARN Please initialize the log4j system properly.
> typeFromContentTypeHint = application/rtf
> typeFromResourceName = application/rtf
> typeFromHeader = null
> type = application/rtf
> typeFromContentTypeHint = text/plain
> typeFromResourceName = text/plain
> typeFromHeader = null
> type = text/plain
> typeFromContentTypeHint = application/msword
> typeFromResourceName = application/msword
> typeFromHeader = null
> type = application/msword
> typeFromContentTypeHint = application/xml
> typeFromResourceName = application/xml
> typeFromHeader = null
> type = application/xml

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (TIKA-79) Mime type detection from file header appears to be failing.

Posted by "Keith R. Bennett" <kb...@bbsinc.biz>.

Chris -

If I understand correctly, we already have what we need in MimeUtils:
    public String getType(String typeName, String url, byte[] data) { ... }

Jukka, should I modify AutoDetectParser to call this method instead of its
own?

However, the bigger issue is, is the assessment that header based detection
fails with certain file types correct?  For example, it fails to identify
the type of the Powerpoint test document we provide.  Do we know which types
can and can't be detected?  If so, it would be helpful to our users and
ourselves to document that information.  I could put something together
based on my observations, but that would risk being incomplete or incorrect
due to different document software versions (e.g. Word).

- Keith

JIRA jira@apache.org wrote:
> 
> 
>     [
> https://issues.apache.org/jira/browse/TIKA-79?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535917
> ] 
> 
> Chris A. Mattmann commented on TIKA-79:
> ---------------------------------------
> 
> Guys:
> 
> Why don't we put a utility method in MimeUtils to handle this
> functionality. The purpose of the utility method is to try and sense a
> mime type using all available options (URL resolution, extension ID, mime
> magic, etc.)
> 
> There is currently code in Nutch at:
> 
> http://svn.apache.org/viewvc/lucene/nutch/trunk/src/java/org/apache/nutch/protocol/Content.java?view=markup
> 
> See the private String getContentType(String typeName, String url, byte[]
> data) method at the bottom of the class to see how Nutch does this sort of
> failsafe mime resolution. Perhaps we should follow similar suit in Tika?
> 
> Cheers,
>  Chris
> 

-- 
View this message in context: http://www.nabble.com/-jira--Created%3A-%28TIKA-79%29-Mime-type-detection-from-file-header-appears-to-be-failing.-tf4644634.html#a13276570
Sent from the Apache Tika - Development mailing list archive at Nabble.com.