You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Petr Pytelka (JIRA)" <ji...@apache.org> on 2013/04/29 14:20:22 UTC

[jira] [Created] (TIKA-1116) Wrong detection of XLS/Doc fil

Petr Pytelka created TIKA-1116:
----------------------------------

             Summary: Wrong detection of XLS/Doc fil
                 Key: TIKA-1116
                 URL: https://issues.apache.org/jira/browse/TIKA-1116
             Project: Tika
          Issue Type: Bug
          Components: mime
    Affects Versions: 1.3, 1.4
            Reporter: Petr Pytelka


My issue:
I have valid XLS file and this file is detected as DOC.

Cause:
tika-mimetypes.xml contain lines:

  <mime-type type="application/msword">
..
      <match value="\320\317\021\340\241\261\032\341" type="string" offset="0"/>
..
  </mime-type>

According to MS documentation this prefix can be in any Compound Binary file (DOC, XLS, PPT and others).
There is documentation: http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/WindowsCompoundBinaryFileFormatSpecification.pdf (look at 2.1 Header)

My proposal is to remove line
      <match value="\320\317\021\340\241\261\032\341" type="string" offset="0"/>
from tika-mimetypes.xml.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira