You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2013/04/29 15:00:17 UTC

[jira] [Commented] (TIKA-1116) Wrong detection of XLS/Doc fil

    [ https://issues.apache.org/jira/browse/TIKA-1116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13644466#comment-13644466 ] 

Nick Burch commented on TIKA-1116:
----------------------------------

Detecting office file formats with just mime magic isn't possible with 100% accuracy. If you want that, you need to allow the use of POIFSContainerDetector, which works out the type based on the actual contents of the container. 
                
> Wrong detection of XLS/Doc fil
> ------------------------------
>
>                 Key: TIKA-1116
>                 URL: https://issues.apache.org/jira/browse/TIKA-1116
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.3, 1.4
>            Reporter: Petr Pytelka
>              Labels: DOC,, XLS
>
> My issue:
> I have valid XLS file and this file is detected as DOC.
> Cause:
> tika-mimetypes.xml contain lines:
>   <mime-type type="application/msword">
> ..
>       <match value="\320\317\021\340\241\261\032\341" type="string" offset="0"/>
> ..
>   </mime-type>
> According to MS documentation this prefix can be in any Compound Binary file (DOC, XLS, PPT and others).
> There is documentation: http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/WindowsCompoundBinaryFileFormatSpecification.pdf (look at 2.1 Header)
> My proposal is to remove line
>       <match value="\320\317\021\340\241\261\032\341" type="string" offset="0"/>
> from tika-mimetypes.xml.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira