You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2021/09/15 13:43:00 UTC

[jira] [Commented] (TIKA-3556) DefaultZipContainerDetector returns application/zip for .odt files when OPCPackageDetector is on the classpath

    [ https://issues.apache.org/jira/browse/TIKA-3556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415521#comment-17415521 ] 

Tim Allison commented on TIKA-3556:
-----------------------------------

Able to reproduce this. I'm surprised our unit tests didn't catch this.  Adding new ones and fixing.  Thank you for opening this issue and diagnosing the bugs!

> DefaultZipContainerDetector returns application/zip for .odt files when OPCPackageDetector is on the classpath
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-3556
>                 URL: https://issues.apache.org/jira/browse/TIKA-3556
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 2.1.0
>            Reporter: Simon Gaeremynck
>            Priority: Major
>
> This is happening because the OPCPackageDetector.detect method will [fail and close the underlying zip stream|https://github.com/apache/tika/blob/2.1.0-rc2/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/detect/microsoft/ooxml/OPCPackageDetector.java#L257]. When the next detector runs (e.g. OpenDocumentDetector), the stream it receives has been closed and it won't be able to detect anything.
> After all detectors have effectively no-oped, [the DefaultZipContainerDetector falls back to application/zip|https://github.com/apache/tika/blob/2.1.0-rc2/tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-zip-commons/src/main/java/org/apache/tika/detect/zip/DefaultZipContainerDetector.java#L209].
> Now, when running with the default CompositeDetector, the next detector is usually the MimeTypes detector. This returns the proper application/vnd.oasis.opendocument.text, but the [CompositeDetector will ignore|https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/detect/CompositeDetector.java#L86] it as that mime type isn't marked up as a subclass of application/zip in [the registry|https://github.com/apache/tika/blob/2.1.0-rc2/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L2327].
>  
> In short, I think there are two bugs here potentially:
>  # The OPCPacakageDetector either shouldn't close the zip while detecting or the DefaultZipContainerDetector should re-open if necessary?
>  # The registry should be updated to mark up application/vnd.oasis.opendocument.text as a subclass of application/zip ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)