You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Hudson (Jira)" <ji...@apache.org> on 2021/11/23 18:32:00 UTC

[jira] [Commented] (TIKA-3596) Detect truncated/bad encoded XML files as application/xml instead of text/plain

    [ https://issues.apache.org/jira/browse/TIKA-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17448192#comment-17448192 ] 

Hudson commented on TIKA-3596:
------------------------------

SUCCESS: Integrated in Jenkins build Tika » tika-branch1x-jdk8 #148 (See [https://ci-builds.apache.org/job/Tika/job/tika-branch1x-jdk8/148/])
TIKA-3596: detect truncated/bad encoded xml files as application/xml (lfcnassif: [https://github.com/apache/tika/commit/92f10558a019b2a386a41e5fac745a9e05cd3e1b])
* (edit) tika-core/src/test/java/org/apache/tika/mime/MimeDetectionTest.java
* (add) tika-core/src/test/resources/org/apache/tika/mime/truncated-utf16-xml.xyz
* (edit) tika-core/src/main/java/org/apache/tika/detect/XmlRootExtractor.java


> Detect truncated/bad encoded XML files as application/xml instead of text/plain
> -------------------------------------------------------------------------------
>
>                 Key: TIKA-3596
>                 URL: https://issues.apache.org/jira/browse/TIKA-3596
>             Project: Tika
>          Issue Type: Improvement
>          Components: detector
>    Affects Versions: 1.27, 2.1.0
>            Reporter: Luís Filipe Nassif
>            Assignee: Luís Filipe Nassif
>            Priority: Minor
>             Fix For: 1.27.1, 2.1.1
>
>         Attachments: test.xyz
>
>
> There is a logic in MimeTypes class to return text/plain for corrupted xml files not detected as text/html here: https://github.com/apache/tika/blob/324f2f2ccff21c608969e2e79da88e71379a58dc/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java#L281
> I think this should be changed to return application/xml, even if the file is corrupted, like is done for all other mimetypes, being more consistent across file formats. Even if a jpg or doc file is corrupted, image/jpg or application/msword is returned.
> I have about ~2k from ~90k xml files in an internal corpus that trigger this.
> If other fellow devs agree, I can submit a patch and unit test.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)