You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2022/03/14 14:04:00 UTC
[jira] [Comment Edited] (TIKA-3700) DefaultZipContainerDetector fails to recognize .docx file

    [ https://issues.apache.org/jira/browse/TIKA-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17506248#comment-17506248 ] 

Tim Allison edited comment on TIKA-3700 at 3/14/22, 2:03 PM:
-------------------------------------------------------------

Thank you for opening this issue.  My initial thought was that maybe there's a difference with detecting from a file vs an inputstream; and I just tested that now at least on main, and I'm consistently getting *.wordprocessingml.document on our test .docx files.

There's one place where' there's a diff between a stream and a file, and that's in our truncated docx file.  If there's a stream, it is detected as *wordprocessing.document, but if it is a file (and I do not include the file name in the metadata), then it's detected as tika-ooxml.

I wonder if the corrupted (?) file is causing an exception during detection which leads to different results?

This is definitely weird.  I have no idea how your test file is getting corrupted.

Is the test file corrupted the same with 2.2.1 and 2.2.1 is more resilient?  Or is the file not getting corrupted with 2.2.1?

Are you seeing this behavior in your application?

How are you opening your inputstream?  TikaInputStream.get(Path) or TikaInputStream.get(Path, Metadata) or something else?

Can you share the corrupt file with us?



was (Author: tallison@mitre.org):
Thank you for opening this issue.  My initial thought was that maybe there's a difference with detecting from a file vs an inputstream; and I just tested that now at least on main, and I'm consistently getting *.wordprocessingml.document on our test .docx files.

There's one place where' there's a diff between a stream and a file, and that's in our truncated docx file.  If there's a stream, it is detected as *wordprocessing.document, but if it is a file, then it's detected as tika-ooxml.

I wonder if the corrupted (?) file is causing an exception during detection which leads to different results?

This is definitely weird.  I have no idea how your test file is getting corrupted.

Is the test file corrupted the same with 2.2.1 and 2.2.1 is more resilient?  Or is the file not getting corrupted with 2.2.1?

Are you seeing this behavior in your application?

How are you opening your inputstream?  TikaInputStream.get(Path) or TikaInputStream.get(Path, Metadata) or something else?

Can you share the corrupt file with us?


> DefaultZipContainerDetector fails to recognize .docx file
> ---------------------------------------------------------
>
>                 Key: TIKA-3700
>                 URL: https://issues.apache.org/jira/browse/TIKA-3700
>             Project: Tika
>          Issue Type: Bug
>          Components: detector
>    Affects Versions: 2.3.0
>         Environment: Ubuntu + mvn 3.6.3 + java 8
>            Reporter: Michał Ruszkowski
>            Priority: Major
>
> Hello,
> Recently my team upgraded from Tika 1.x to 2.3 due to vulnerability and I noticed problem with file type detection based on content.
>  * we have simple test that calls method 
> {code:java}
> tika.getDetector().detect(tikaInputStream, metadata);{code}
>  * the file that we create inputStream from is placed inside _/test/resources_ and it is *.docx*
>  * the detector method DefaultZipContainerDetector.detect() returns application/x-tika-ooxml when we run mvn install
>  * following test was working with Tika 1.x
>  * we have dependencies in pom.xml _*tika-core*_ and _*tika-parsers-standard-package*_           
> The most strange is the fact that the same test run successfully through IntelliJ 'Run Test...' button.
>  * I tried using UTF-8 encoding in maven's pom.xml as well as using parameter -Dfile.encoding=UTF-8 while install with no success.
>  * I compared content of files in boths cases (successfull test and failed one) and they look almost the same, however in one case whitespaces seems to be bigger. Don't know if it can make a difference, but here is example content of file that is properly detected: 
> {code:java}
> �l�������:0Tɭ�"Э�p'䧘 ��tn��&� q(=X�� ��!.���,�_�WF�L8W()���u{code}
>  
> and here is the same line of content that fails (notice additional whitespace before 'q(='
> {code:java}
> �l�������:0Tɭ�"Э�p'䧘 ��tn��&�  q(=X�� ��!.���,�_�WF�L8W()���u {code}
>  * I just checked and it works fine with Tika 2.2.1



--
This message was sent by Atlassian Jira
(v8.20.1#820001)