You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Maxim Valyanskiy (JIRA)" <ji...@apache.org> on 2009/07/01 13:24:47 UTC
[jira] Created: (TIKA-257) Uncorrect mime-type detection for ooxml
Uncorrect mime-type detection for ooxml
---------------------------------------
Key: TIKA-257
URL: https://issues.apache.org/jira/browse/TIKA-257
Project: Tika
Issue Type: Bug
Components: general
Affects Versions: 0.4
Reporter: Maxim Valyanskiy
MimeTypes detects docx (and other office XML documents) as 'application/zip' when file does not have proper extension:
$ java -jar tika-app/target/tika-app-0.4-SNAPSHOT.jar -m /home/maxcom/download-tmp/proto.docx
Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document
resourceName: proto.docx
$ cat /home/maxcom/download-tmp/proto.docx | java -jar tika-app/target/tika-app-0.4-SNAPSHOT.jar -m
Content-Type: application/zip
This breaks text extraction when filename is not known
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (TIKA-257) Uncorrect mime-type detection for ooxml
Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved TIKA-257.
--------------------------------
Resolution: Fixed
Fix Version/s: 0.4
Assignee: Jukka Zitting
I found a pretty accurate magic byte pattern (the file name string [Content_Types].xml at offset 30) for OOXML files. This still doesn't tell whether the document is a spreadsheet, a presentation or something different, but at least it's enough to allow Tika to correctly send the document to OOXMLParser for more detailed processing with POI.
I added the byte pattern and made some related adjustments in revision 793696. The above test case now passes.
Resolving as Fixed.
> Uncorrect mime-type detection for ooxml
> ---------------------------------------
>
> Key: TIKA-257
> URL: https://issues.apache.org/jira/browse/TIKA-257
> Project: Tika
> Issue Type: Bug
> Components: general
> Affects Versions: 0.4
> Reporter: Maxim Valyanskiy
> Assignee: Jukka Zitting
> Fix For: 0.4
>
>
> MimeTypes detects docx (and other office XML documents) as 'application/zip' when file does not have proper extension:
> $ java -jar tika-app/target/tika-app-0.4-SNAPSHOT.jar -m /home/maxcom/download-tmp/proto.docx
> Content-Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document
> resourceName: proto.docx
> $ cat /home/maxcom/download-tmp/proto.docx | java -jar tika-app/target/tika-app-0.4-SNAPSHOT.jar -m
> Content-Type: application/zip
> This breaks text extraction when filename is not known
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.