You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Antoni Mylka (JIRA)" <ji...@apache.org> on 2010/08/19 20:13:17 UTC

[jira] Created: (TIKA-486) ContainerAwareDetector doesn't support non-MSOffice files which use the same magic

ContainerAwareDetector doesn't support non-MSOffice files which use the same magic
----------------------------------------------------------------------------------

                 Key: TIKA-486
                 URL: https://issues.apache.org/jira/browse/TIKA-486
             Project: Tika
          Issue Type: Improvement
            Reporter: Antoni Mylka


There are many applications which use the MSOffice magic number. I know of Corel Presentations X3, Corel Quattro Pro 7 and X3 and Microsoft Works Word Processor. They have their own mime types. 

They aren't properly supported by POI though which means that if the ContentAwareDetector finds such a file, it will resort to the POIFSContainerDetector and return the basic application/x-tika-msoffice file type because POI won't be able to say anything more specific. This will happen even in situations when the fallback detector might come up with a better answer.

That's why IMHO the fallback detector should be used if the POIFSContainerDetector returns x-tika-msoffice. If the fallback detector comes up with a more specific type - the more specific one should be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-486) ContainerAwareDetector doesn't support non-MSOffice files which use the same magic

Posted by "Antoni Mylka (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Antoni Mylka updated TIKA-486:
------------------------------

    Attachment: tika-non-office-files-with-office-magic.patch
                test-documents.zip

Four test documents (to be placed in test-documents) and a patch.

> ContainerAwareDetector doesn't support non-MSOffice files which use the same magic
> ----------------------------------------------------------------------------------
>
>                 Key: TIKA-486
>                 URL: https://issues.apache.org/jira/browse/TIKA-486
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Antoni Mylka
>         Attachments: test-documents.zip, tika-non-office-files-with-office-magic.patch
>
>
> There are many applications which use the MSOffice magic number. I know of Corel Presentations X3, Corel Quattro Pro 7 and X3 and Microsoft Works Word Processor. They have their own mime types. 
> They aren't properly supported by POI though which means that if the ContentAwareDetector finds such a file, it will resort to the POIFSContainerDetector and return the basic application/x-tika-msoffice file type because POI won't be able to say anything more specific. This will happen even in situations when the fallback detector might come up with a better answer.
> That's why IMHO the fallback detector should be used if the POIFSContainerDetector returns x-tika-msoffice. If the fallback detector comes up with a more specific type - the more specific one should be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-486) ContainerAwareDetector doesn't support non-MSOffice files which use the same magic

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902369#action_12902369 ] 

Nick Burch commented on TIKA-486:
---------------------------------

Thanks for the patch and the files, it's true we hadn't considered non-microsoft OLE2 files

The patch might want a few little tweaks to comments and if statement ordering to make it clearer what's going on, but the basic logic looks sound. I'll apply it with some tweaks in a few days, assuming no-one beats me to it!

> ContainerAwareDetector doesn't support non-MSOffice files which use the same magic
> ----------------------------------------------------------------------------------
>
>                 Key: TIKA-486
>                 URL: https://issues.apache.org/jira/browse/TIKA-486
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Antoni Mylka
>         Attachments: test-documents.zip, tika-non-office-files-with-office-magic.patch
>
>
> There are many applications which use the MSOffice magic number. I know of Corel Presentations X3, Corel Quattro Pro 7 and X3 and Microsoft Works Word Processor. They have their own mime types. 
> They aren't properly supported by POI though which means that if the ContentAwareDetector finds such a file, it will resort to the POIFSContainerDetector and return the basic application/x-tika-msoffice file type because POI won't be able to say anything more specific. This will happen even in situations when the fallback detector might come up with a better answer.
> That's why IMHO the fallback detector should be used if the POIFSContainerDetector returns x-tika-msoffice. If the fallback detector comes up with a more specific type - the more specific one should be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-486) ContainerAwareDetector doesn't support non-MSOffice files which use the same magic

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906524#action_12906524 ] 

Nick Burch commented on TIKA-486:
---------------------------------

Thinking about it some more, these non Microsoft files which use OLE2 are going to be equally as tricky to reliably spot with only magic number detection. Just as with the microsoft formats, you can't predict where in the OLE2 file the key blocks will fall, so it's very hard to spot the magic numbers as they could be anywhere

I think the real solution is to update the OLE2 container aware detector to know about the entries in these files, so it can handle them correctly. I'm going to go ahead and do this shortly

> ContainerAwareDetector doesn't support non-MSOffice files which use the same magic
> ----------------------------------------------------------------------------------
>
>                 Key: TIKA-486
>                 URL: https://issues.apache.org/jira/browse/TIKA-486
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Antoni Mylka
>         Attachments: test-documents.zip, tika-non-office-files-with-office-magic.patch
>
>
> There are many applications which use the MSOffice magic number. I know of Corel Presentations X3, Corel Quattro Pro 7 and X3 and Microsoft Works Word Processor. They have their own mime types. 
> They aren't properly supported by POI though which means that if the ContentAwareDetector finds such a file, it will resort to the POIFSContainerDetector and return the basic application/x-tika-msoffice file type because POI won't be able to say anything more specific. This will happen even in situations when the fallback detector might come up with a better answer.
> That's why IMHO the fallback detector should be used if the POIFSContainerDetector returns x-tika-msoffice. If the fallback detector comes up with a more specific type - the more specific one should be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-486) ContainerAwareDetector doesn't support non-MSOffice files which use the same magic

Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Burch resolved TIKA-486.
-----------------------------

         Assignee: Nick Burch
    Fix Version/s: 0.8
       Resolution: Fixed

Thanks for the sample files. I've added basic mime types entries for them in r993098.

In r993108, I've also added detection support for them to the OLE2 container detector, as well as some logic to the parent that should help in the unknown case, which I think should cover the case you previously found, but in a more general way across all container detectors.

> ContainerAwareDetector doesn't support non-MSOffice files which use the same magic
> ----------------------------------------------------------------------------------
>
>                 Key: TIKA-486
>                 URL: https://issues.apache.org/jira/browse/TIKA-486
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Antoni Mylka
>            Assignee: Nick Burch
>             Fix For: 0.8
>
>         Attachments: test-documents.zip, tika-non-office-files-with-office-magic.patch
>
>
> There are many applications which use the MSOffice magic number. I know of Corel Presentations X3, Corel Quattro Pro 7 and X3 and Microsoft Works Word Processor. They have their own mime types. 
> They aren't properly supported by POI though which means that if the ContentAwareDetector finds such a file, it will resort to the POIFSContainerDetector and return the basic application/x-tika-msoffice file type because POI won't be able to say anything more specific. This will happen even in situations when the fallback detector might come up with a better answer.
> That's why IMHO the fallback detector should be used if the POIFSContainerDetector returns x-tika-msoffice. If the fallback detector comes up with a more specific type - the more specific one should be used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.