You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "John Mastarone (Created) (JIRA)" <ji...@apache.org> on 2011/11/21 03:43:51 UTC

[jira] [Created] (TIKA-786) Tika CLI --detect returns incorrect content-type for files with altered extensions

Tika CLI --detect returns incorrect content-type for files with altered extensions
----------------------------------------------------------------------------------

                 Key: TIKA-786
                 URL: https://issues.apache.org/jira/browse/TIKA-786
             Project: Tika
          Issue Type: Bug
          Components: cli
    Affects Versions: 1.1
         Environment: Windows
            Reporter: John Mastarone
            Priority: Minor


>From a discussion on the user mailing list on Nov. 11 2011, where the following was requested as a new bug: Tika CLI will return incorrect content type information when called with --detect for files that have had their extensions modified (and nothing else).  MS Word (.doc) documents that have their extension changed to .xls or .ppt will be incorrectly detected as Excel or PowerPoint documents, whereas the --metadata option will determine the content type correctly (as application/msword), based on the actual contents of these mis-named files.  The same also occurs with other types of MS Office 2003 documents, and could possibly occur with a wide range of document types.  To quote Nick B., from the user mailing list: "If you look at the TestMediaTypes class you'll see what you can get with just the mime magic and filenames, and then there's TestContainerAwareDetector which shows the correct detection happening by using the extra detectors available".   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-786) Tika CLI --detect returns incorrect content-type for files with altered extensions

Posted by "Jukka Zitting (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13154154#comment-13154154 ] 

Jukka Zitting commented on TIKA-786:
------------------------------------

Cool, looks good. I was simultaneously approaching this from a slightly different angle (see https://github.com/jukka/tika/commit/97a15bdcd79549d3c5147b7b8f9b6f46a9bb8fc5), but your changes look nicer (I like the way you can give preference to non-Tika detectors) so let's go with that.
                
> Tika CLI --detect returns incorrect content-type for files with altered extensions
> ----------------------------------------------------------------------------------
>
>                 Key: TIKA-786
>                 URL: https://issues.apache.org/jira/browse/TIKA-786
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 1.1
>         Environment: Windows
>            Reporter: John Mastarone
>            Priority: Minor
>
> From a discussion on the user mailing list on Nov. 11 2011, where the following was requested as a new bug: Tika CLI will return incorrect content type information when called with --detect for files that have had their extensions modified (and nothing else).  MS Word (.doc) documents that have their extension changed to .xls or .ppt will be incorrectly detected as Excel or PowerPoint documents, whereas the --metadata option will determine the content type correctly (as application/msword), based on the actual contents of these mis-named files.  The same also occurs with other types of MS Office 2003 documents, and could possibly occur with a wide range of document types.  To quote Nick B., from the user mailing list: "If you look at the TestMediaTypes class you'll see what you can get with just the mime magic and filenames, and then there's TestContainerAwareDetector which shows the correct detection happening by using the extra detectors available".   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-786) Tika CLI --detect returns incorrect content-type for files with altered extensions

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13154103#comment-13154103 ] 

Nick Burch commented on TIKA-786:
---------------------------------

In r1204435, I've added some failing+disabled unit tests for this. If you re-enable the tests on lines 81-83 and 127-129, you'll see this issue
                
> Tika CLI --detect returns incorrect content-type for files with altered extensions
> ----------------------------------------------------------------------------------
>
>                 Key: TIKA-786
>                 URL: https://issues.apache.org/jira/browse/TIKA-786
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 1.1
>         Environment: Windows
>            Reporter: John Mastarone
>            Priority: Minor
>
> From a discussion on the user mailing list on Nov. 11 2011, where the following was requested as a new bug: Tika CLI will return incorrect content type information when called with --detect for files that have had their extensions modified (and nothing else).  MS Word (.doc) documents that have their extension changed to .xls or .ppt will be incorrectly detected as Excel or PowerPoint documents, whereas the --metadata option will determine the content type correctly (as application/msword), based on the actual contents of these mis-named files.  The same also occurs with other types of MS Office 2003 documents, and could possibly occur with a wide range of document types.  To quote Nick B., from the user mailing list: "If you look at the TestMediaTypes class you'll see what you can get with just the mime magic and filenames, and then there's TestContainerAwareDetector which shows the correct detection happening by using the extra detectors available".   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (TIKA-786) Tika CLI --detect returns incorrect content-type for files with altered extensions

Posted by "Nick Burch (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Burch resolved TIKA-786.
-----------------------------

       Resolution: Fixed
    Fix Version/s: 1.1

Explanation added to CHANGES in r1204479, so I think this is now resolved
                
> Tika CLI --detect returns incorrect content-type for files with altered extensions
> ----------------------------------------------------------------------------------
>
>                 Key: TIKA-786
>                 URL: https://issues.apache.org/jira/browse/TIKA-786
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 1.1
>         Environment: Windows
>            Reporter: John Mastarone
>            Priority: Minor
>             Fix For: 1.1
>
>
> From a discussion on the user mailing list on Nov. 11 2011, where the following was requested as a new bug: Tika CLI will return incorrect content type information when called with --detect for files that have had their extensions modified (and nothing else).  MS Word (.doc) documents that have their extension changed to .xls or .ppt will be incorrectly detected as Excel or PowerPoint documents, whereas the --metadata option will determine the content type correctly (as application/msword), based on the actual contents of these mis-named files.  The same also occurs with other types of MS Office 2003 documents, and could possibly occur with a wide range of document types.  To quote Nick B., from the user mailing list: "If you look at the TestMediaTypes class you'll see what you can get with just the mime magic and filenames, and then there's TestContainerAwareDetector which shows the correct detection happening by using the extra detectors available".   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-786) Tika CLI --detect returns incorrect content-type for files with altered extensions

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13154110#comment-13154110 ] 

Nick Burch commented on TIKA-786:
---------------------------------

The problem seems to be with how DefaultDetector handles conflicting detection, which is different to how the previous ContainerAwareDetector did so

Previously, the logic was to ask the container detectors to review the file. If they had a good match, that was used as the mimetype. Only if the container ones didn't know would the mime magic+filename detection (provided by MimeTypes) be used

Under the new DefaultDetector system, this has changed. Instead, each detector is tried in turn, and while detectors are allowed to specialise a file they are not permitted to change it completely (if a previous one was wrong)

It looks like this DefaultDetector logic will need to be changed, to allow detectors such as the container ones to override incorrect (typically filename based) detection
                
> Tika CLI --detect returns incorrect content-type for files with altered extensions
> ----------------------------------------------------------------------------------
>
>                 Key: TIKA-786
>                 URL: https://issues.apache.org/jira/browse/TIKA-786
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 1.1
>         Environment: Windows
>            Reporter: John Mastarone
>            Priority: Minor
>
> From a discussion on the user mailing list on Nov. 11 2011, where the following was requested as a new bug: Tika CLI will return incorrect content type information when called with --detect for files that have had their extensions modified (and nothing else).  MS Word (.doc) documents that have their extension changed to .xls or .ppt will be incorrectly detected as Excel or PowerPoint documents, whereas the --metadata option will determine the content type correctly (as application/msword), based on the actual contents of these mis-named files.  The same also occurs with other types of MS Office 2003 documents, and could possibly occur with a wide range of document types.  To quote Nick B., from the user mailing list: "If you look at the TestMediaTypes class you'll see what you can get with just the mime magic and filenames, and then there's TestContainerAwareDetector which shows the correct detection happening by using the extra detectors available".   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-786) Tika CLI --detect returns incorrect content-type for files with altered extensions

Posted by "Jukka Zitting (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13154135#comment-13154135 ] 

Jukka Zitting commented on TIKA-786:
------------------------------------

bq. Do we have any control over the ordering though?

Some. The type database always comes first, which for most use cases should be good enough.


bq. One situation where the mimetype detection is better is with truncated files.

Right. The good thing about the container detectors is that they only give a result (other than application/octet-stream) if they're really sure about the detection result. So with the proposed reverse detection order the type database would always be consulted last and be able to provide a fallback result in case none of the more accurate detectors worked.
                
> Tika CLI --detect returns incorrect content-type for files with altered extensions
> ----------------------------------------------------------------------------------
>
>                 Key: TIKA-786
>                 URL: https://issues.apache.org/jira/browse/TIKA-786
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 1.1
>         Environment: Windows
>            Reporter: John Mastarone
>            Priority: Minor
>
> From a discussion on the user mailing list on Nov. 11 2011, where the following was requested as a new bug: Tika CLI will return incorrect content type information when called with --detect for files that have had their extensions modified (and nothing else).  MS Word (.doc) documents that have their extension changed to .xls or .ppt will be incorrectly detected as Excel or PowerPoint documents, whereas the --metadata option will determine the content type correctly (as application/msword), based on the actual contents of these mis-named files.  The same also occurs with other types of MS Office 2003 documents, and could possibly occur with a wide range of document types.  To quote Nick B., from the user mailing list: "If you look at the TestMediaTypes class you'll see what you can get with just the mime magic and filenames, and then there's TestContainerAwareDetector which shows the correct detection happening by using the extra detectors available".   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-786) Tika CLI --detect returns incorrect content-type for files with altered extensions

Posted by "Jukka Zitting (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13154127#comment-13154127 ] 

Jukka Zitting commented on TIKA-786:
------------------------------------

Hmm, I didn't think of such a case when doing the DefaultDetector logic. My idea was that more accurate container detectors would just refine a more generic detection result from the basic detectors that are always run first. In this case though the basic detector ends up giving wrong results, which breaks my logic.

Since the container detectors give practically always correct results, I guess it's fine to always use their results. Or perhaps even better, we could check the detectors in reverse order so that the most accurate detection result is used as the starting point and less accurate detection based on things like the file name could only refine the detection result to a more specific media type.
                
> Tika CLI --detect returns incorrect content-type for files with altered extensions
> ----------------------------------------------------------------------------------
>
>                 Key: TIKA-786
>                 URL: https://issues.apache.org/jira/browse/TIKA-786
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 1.1
>         Environment: Windows
>            Reporter: John Mastarone
>            Priority: Minor
>
> From a discussion on the user mailing list on Nov. 11 2011, where the following was requested as a new bug: Tika CLI will return incorrect content type information when called with --detect for files that have had their extensions modified (and nothing else).  MS Word (.doc) documents that have their extension changed to .xls or .ppt will be incorrectly detected as Excel or PowerPoint documents, whereas the --metadata option will determine the content type correctly (as application/msword), based on the actual contents of these mis-named files.  The same also occurs with other types of MS Office 2003 documents, and could possibly occur with a wide range of document types.  To quote Nick B., from the user mailing list: "If you look at the TestMediaTypes class you'll see what you can get with just the mime magic and filenames, and then there's TestContainerAwareDetector which shows the correct detection happening by using the extra detectors available".   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-786) Tika CLI --detect returns incorrect content-type for files with altered extensions

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13154130#comment-13154130 ] 

Nick Burch commented on TIKA-786:
---------------------------------

Do we have any control over the ordering though? My hunch is that user supplied ones should probably be used in preference to Tika ones, and the parser based detectors in Tika should be used in preference to the Mime Type ones

One situation where the mimetype detection is better is with truncated files. Here the container detector can just say "looks like one of mine, can't tell you any more" while the mimetype one can use the filename to fill in the rest. I've a feeling that at least some people pass in only the first few kb of files for detection, to ensure it's fast, so their use case would want the MimeTypes detector logic based on filename to kick in to specialise.
                
> Tika CLI --detect returns incorrect content-type for files with altered extensions
> ----------------------------------------------------------------------------------
>
>                 Key: TIKA-786
>                 URL: https://issues.apache.org/jira/browse/TIKA-786
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 1.1
>         Environment: Windows
>            Reporter: John Mastarone
>            Priority: Minor
>
> From a discussion on the user mailing list on Nov. 11 2011, where the following was requested as a new bug: Tika CLI will return incorrect content type information when called with --detect for files that have had their extensions modified (and nothing else).  MS Word (.doc) documents that have their extension changed to .xls or .ppt will be incorrectly detected as Excel or PowerPoint documents, whereas the --metadata option will determine the content type correctly (as application/msword), based on the actual contents of these mis-named files.  The same also occurs with other types of MS Office 2003 documents, and could possibly occur with a wide range of document types.  To quote Nick B., from the user mailing list: "If you look at the TestMediaTypes class you'll see what you can get with just the mime magic and filenames, and then there's TestContainerAwareDetector which shows the correct detection happening by using the extra detectors available".   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-786) Tika CLI --detect returns incorrect content-type for files with altered extensions

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13154145#comment-13154145 ] 

Nick Burch commented on TIKA-786:
---------------------------------

I've had a go at solving this in r1204476, by having DefaultDetector order them differently, based on the discussions here. (The reversing is done here, rather than in CompositeDetector, as that seems to make more sense to me)

This has allowed me to enable the previously failing tests for this issue, and all other tests still pass
                
> Tika CLI --detect returns incorrect content-type for files with altered extensions
> ----------------------------------------------------------------------------------
>
>                 Key: TIKA-786
>                 URL: https://issues.apache.org/jira/browse/TIKA-786
>             Project: Tika
>          Issue Type: Bug
>          Components: cli
>    Affects Versions: 1.1
>         Environment: Windows
>            Reporter: John Mastarone
>            Priority: Minor
>
> From a discussion on the user mailing list on Nov. 11 2011, where the following was requested as a new bug: Tika CLI will return incorrect content type information when called with --detect for files that have had their extensions modified (and nothing else).  MS Word (.doc) documents that have their extension changed to .xls or .ppt will be incorrectly detected as Excel or PowerPoint documents, whereas the --metadata option will determine the content type correctly (as application/msword), based on the actual contents of these mis-named files.  The same also occurs with other types of MS Office 2003 documents, and could possibly occur with a wide range of document types.  To quote Nick B., from the user mailing list: "If you look at the TestMediaTypes class you'll see what you can get with just the mime magic and filenames, and then there's TestContainerAwareDetector which shows the correct detection happening by using the extra detectors available".   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira