You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Cservenak, Tamas (JIRA)" <ji...@apache.org> on 2014/05/22 16:03:02 UTC

[jira] [Comment Edited] (TIKA-1292) Inconsistent priorities in bundled tika-mimetypes.xml

    [ https://issues.apache.org/jira/browse/TIKA-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005930#comment-14005930 ] 

Cservenak, Tamas edited comment on TIKA-1292 at 5/22/14 2:01 PM:
-----------------------------------------------------------------

My test project on https://github.com/cstamas/tika-1292 , with locally built Tika 1.6-SNAPSHOT (built off r1596612) passes just fine.
Also, tried with the r1596590 change locally reverted (the tika-mimetypes.xml change that ups priority of ZIP), to prove that new code works as expected: all is fine.
This issue can be closed as "fixed".


was (Author: cstamas):
My test project with Tika 1.6-SNAPSHOT (built off r1596612) passes just fine.
Also, triad with the r1596590 change locally reverted (the tika-mimetypes.xml change that ups priority of ZIP), to prove that new code works as expected: all is fine.
This issue can be closed as "fixed".

> Inconsistent priorities in bundled tika-mimetypes.xml
> -----------------------------------------------------
>
>                 Key: TIKA-1292
>                 URL: https://issues.apache.org/jira/browse/TIKA-1292
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.5
>            Reporter: Cservenak, Tamas
>
> It seems that mime-type priorities are a bit inconsistent in the tika-core bundled tika-mimetypes.xml
> Few examples:
> * [application/zip|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3497] vs [application/x-7z-compressed|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3510]: both are similar "containers" archive formats (structured, having entries), having distinct file extensions ("zip" vs "7z" globs), still priorities are 40 and 50 respectively.
> * [application/zip|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3497] vs [text/html|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L4713]: not quite related MIME types, having same priority of 40. But ZIP files can be "uncompressed" (meaning entries are mostly "concatenated", and their content, if plaintext, is readable). Hence, having an "uncompressed" ZIP (or any subclass like JAR) file that contains HTML files zipped up might/will be detected as HTML, which is wrong. 
> And this is what happens in Nexus that uses Tika under the hud for "content" validation, basically using MIME magic detection provided by Tika Detector: the Java JAR {{com.intellij:annotations:7.0.3}} ([link|http://repo1.maven.org/maven2/com/intellij/annotations/7.0.3/]) is being detected as {{text/html}} instead of (expected) {{application/java-archive}}.
> Reason is following: the JAR file is zipped up in "uncompressed" zip format, and among few annotations it also contains one HTML file entry (the license I guess). Since both MIME types have same priority (40), I guess tika "randomly" chooses the {{text/html}}.
> Original Nexus issue
> https://issues.sonatype.org/browse/NEXUS-6560
> At Nexus issue there is a GH Pull Request that solves the problem for us (by raising {{application/zip}} priority to 41.
> But by inspecting the bundled tike-mimetypes.xml we spotted other -- probably -- priority inconsistencies, like that of zip vs 7z mentioned above.
> Note: this happens when using tika-core solely on classpath and using it for MIME magic detection. Interestingly, when the tika-parsers (with it's all dependencies) are added to classpath, Tika will properly figure out that the artifact is {{application/java-archive}}. Still, our use case in Nexus requires the MIME magic detection only, so we do not use tika-parsers, nor we would like to do so.
> Sample project to reproduce
> https://github.com/cstamas/tika-1292



--
This message was sent by Atlassian JIRA
(v6.2#6252)